How to run voice user interviews
A practical guide on how to run voice user interviews: question design, participant sourcing, the recording itself, and what to do with what you hear.
Most guides treat the recording itself as the hard part of running voice interviews. It isn't. The hard part is everything around it: knowing what you're actually trying to learn, asking questions that don't prime the answer, and resisting the urge to skim a transcript when you should be listening to a clip.
This is a working playbook for how to run voice user interviews that produce decisions, not document libraries. It's the version we use internally at Talkful, written down, including the bits we've gotten wrong as well as the bits we've kept.
When voice beats text (and when it doesn't)
Voice is the right medium when you want to hear how someone hesitates, what they say before they catch themselves, and which story they tell about their week. Voice is the wrong medium when you want them to rank five prices or count how many times they used a feature. Use a survey for the second. Use a real conversation for the first.
We've made the longer case for voice over text elsewhere, with the data behind it. The short version: spoken responses run several times longer than typed ones on the same prompt, and they carry the false starts, sighs, and "wait, let me start over" moments where the actual product insight tends to live.
How to run voice user interviews, end to end
Seven steps, sequential. You can stretch any of them, but if you skip step 01 or step 05, the rest of the work compounds the error.
01 · Decide what you actually need to learn
Before you write a single question, write one sentence: what decision will I make differently if this study comes back the way I expect, versus the opposite way?
If you can't answer that, don't run the study. You're either looking for validation (which voice will give you, but only because you'll cherry-pick clips), or you're avoiding a decision someone else needs to make.
A good research question is small, specific, and falsifiable. "Why aren't users converting?" is not a research question. "What stops people who finish onboarding from inviting a teammate in their first session?" is.
02 · Write fewer, better questions
Most first-time interview scripts have twelve questions. The good ones have four to six. Talkful caps studies at eight for a reason: by the eighth question, your participant is tired and the data is thin.
Three rules for the questions you do keep:
- Open, not closed. As Maria Rosala writes for Nielsen Norman Group, open-ended questions deliver deeper qualitative insight than closed ones. "What was the last thing that frustrated you about X?" beats "Did you find X frustrating?"
- Concrete, not abstract. "Walk me through the last time you used X" beats "How do you usually use X?" Memory for general patterns is unreliable. Memory for specific incidents is good.
- No leading constructions. Avoid "because" and "even though" in the question stem. As Rosala also notes, "when people are presented with leading questions, they're more likely to agree with the question or succumb to some kind of priming effect".
03 · Source the right participants
Five participants is the rough floor for voice interviews. That's not a strict mapping of Jakob Nielsen's "five users" rule (which was about usability testing, not interview research), but the intuition holds: below five, you can't tell signal from noise. Above twelve, you'll usually hit thematic saturation for a single homogeneous group, a finding Guest, Bunce and Johnson confirmed empirically in 2006: "basic elements for metathemes were present as early as six interviews."
Two practical sourcing notes. First, recruit for the actual decision you're making. Customers who churned tell you something different from prospects who never signed up. Second, don't pay so much that you attract professional respondents. Modest incentives, friction in the consent screen, and a clear "this is for product research" framing all filter for the right kind of participant.
If you're running a single small study to validate an early hunch, ten responses on the Talkful free plan is usually enough to find the next question to ask.
04 · Make the recording feel like a conversation
This is where async voice diverges from a 1:1 interview. The participant is alone with their phone; you're not in the room to nod, reframe, or follow up. Three things make the difference between a thin response and a real one:
- A short, human intro. A few sentences from the PM, with their name and photo, telling the participant what the study is for and how their answer will be used. In Talkful, that's the optional intro message on the consent screen, capped at 600 characters. It's not a memo.
- One question per screen. The participant should never see question 4 while answering question 2. Cognitive crowding tanks response quality.
- A typing fallback that doesn't feel like a punishment. Some people can't speak comfortably (open office, sleeping baby, language they're not confident in). Offering "prefer to type?" as a peer choice rather than a hidden link costs nothing and rescues responses you'd otherwise lose.
A note on length: 90 to 120 seconds per question is the sweet spot. Above 180 seconds, completion drops. Below 60 seconds, participants don't have time to settle into a story before the timer pressure kicks in. Talkful's default is 120 seconds, configurable per question between 15 and 300.
05 · Listen before you analyze
This is the step everyone skips, and it's the only one that's non-negotiable.
Before you open the synthesis tab, listen to the first three to five responses end to end, with no notes. You're recalibrating your sense of what your participant population actually sounds like: tone, pace, the words they use for your product (which are almost never the words you use for your product). After three responses you'll already notice patterns the LLM won't surface, because it doesn't know what to weight.
This is also the step where AI changes the work, not removes it. As NN/g put it in their 2025 outlook, AI systems are starting to summarize and even conduct studies, but the judgment about what matters is still yours.
06 · Synthesize without flattening
Once you've listened, then run the synthesis. In Talkful this needs at least five completed responses; below that, themes are too thin to trust. The output is a written summary, a sentiment distribution across the responses, three to four theme clusters with representative quotes, plus consensus, divergent, and surprise sections.
The trap is to read the synthesis and stop. Synthesis flattens variance by design: it's looking for what's common. The interesting product decisions usually live in the divergent and surprise sections, and inside the individual response transcripts (where each LLM-extracted quote is timestamped against the original audio so you can play exactly that moment back).
A simple rule: for every theme the synthesis surfaces, open at least one underlying transcript and play the clip. If the audio doesn't match the summary's tone, trust the audio.
07 · Decide what to do with what you heard
A study isn't done when the synthesis is generated. It's done when someone changes a roadmap, a piece of copy, or a meeting agenda because of it.
Two anti-patterns to avoid here. First, don't write the executive summary in the LLM's voice. Write it in your own, with quotes (text or audio) as evidence. Second, don't bury the bad news. The findings that hurt are usually the ones worth shipping.
If you find yourself synthesizing the same kind of finding three studies in a row, the next study should be different: a different participant population, a different question, or a different decision you're trying to inform.
Common mistakes (and how to dodge them)
These are the four most common mistakes we see in studies PMs run for the first time:
- Twelve questions instead of five. Cut ruthlessly. Fewer, deeper.
- Leading question stems. "Because" and "even though" almost always smuggle assumptions in.
- Not listening to raw audio. The synthesis is a starting point, not an answer.
- Treating sample size as a research goal. Fifty responses to a bad question are worth less than five responses to a good one.
FAQ
How many participants do I need for a voice user interview?
Five is the practical floor; eight to twelve is comfortable for a single persona. Above twelve, you'll see thematic saturation, where new responses largely confirm patterns you've already heard. If you're studying multiple personas, treat each one as its own count.
How long should each question allow for recording?
90 to 120 seconds. Long enough to settle into a story, short enough that the participant doesn't ramble or lose interest. Allow up to 180 seconds for one or two flagship questions; cap the rest tighter. Talkful's default is 120 seconds per question, configurable between 15 and 300.
Can I run voice interviews if my participants speak different languages?
Yes, with a caveat. Modern transcription (we use Deepgram Nova-3) auto-detects across 50+ languages, so participants can answer in whichever language they think in. Synthesis works best when the dataset shares a language, though. For genuinely cross-language studies, expect to read responses in their original language and to do the cross-language synthesis yourself.
How is voice user research different from a traditional user interview?
A traditional user interview is synchronous, scheduled, and 1:1. Voice user research is async: participants record on their own time, on their own phone, in their own voice. You lose the live, multi-turn follow-up. You gain reach, candor, and a much higher response rate. Talkful narrows the gap with a smart follow-up: after a participant submits a voice or rating answer, an LLM decides whether one clarifying question would sharpen it and shows the probe as a separate full-screen step. It is one probe, async, optional, never a live conversation. The two methods still complement each other. Use async voice (with a smart follow-up where it earns its keep) to figure out what to ask, then a 1:1 interview to chase the threads that need several turns.
Do participants actually record voice answers, or do they bail?
They record. The fear most PMs have ("nobody will hit the button") doesn't bear out. Voice messaging is the dominant form of async communication for billions of people: Meta reports 7 billion voice messages a day on WhatsApp alone. The skill is making the form feel like sending a voice note to a friend, not filling out a survey. The medium isn't the friction. The framing is.
Voice user research isn't a fancier survey. It's a different kind of evidence: less structured, more honest, harder to skim and easier to act on. The work isn't easier than text-based research. It's just better calibrated to what people actually mean.