If you're planning to build a medical scribe or add transcription to a telehealth or health app, the honest answer is this: the speech‑to‑text part is the easy 20%. Turning clinical audio into a note a clinician will actually sign — accurately, safely, and within HIPAA — is the other 80%. A production‑grade in‑house build is realistically a 12–24+ month effort with a specialized team, followed by permanent maintenance. This guide breaks down the real scope, timeline, team, and cost so you can decide whether to build it, use a medical speech-to-text API, or ship it under your brand through a white-label partner.
Why “just add a speech-to-text API” underestimates the build
General‑purpose speech recognition is largely a solved problem. Whisper, Deepgram, Google, and AWS will all transcribe a clean conversation well. Clinical audio is where that confidence breaks down — and it breaks down in ways that are expensive to fix.
- Vocabulary: drug names, dosages, lab values, and specialty terms are exactly the words generic models get wrong — and exactly the words that matter clinically. See medical vs. general speech-to-text for what changes.
- Two speakers, real rooms: a clinician and a patient (or a whole group) talking over each other, on imperfect microphones and unreliable telehealth connections.
- Silence and hallucination: general ASR will confidently invent text over long pauses — a real safety problem in a clinical note.
- And a transcript still isn't a note. The deliverable a clinician wants is a structured SOAP, DAP, BIRP, or GIRP note — not a wall of dialogue.
In other words, the API call you start with is the visible tip. The chart below is what “add transcription” actually expands into once it has to work on real visits. (If you want the pipeline architecture in depth, we cover it in how to build clinical documentation with a medical speech-to-text API.)

The work that actually consumes the timeline
Teams almost always under‑scope a medical scribe because they price the transcription and forget the rest. Here is where the months actually go:
1. Clinical accuracy
Tuning recognition for medical vocabulary, dosages, and specialty language — the difference between “15 milligrams” and “50 milligrams” is not a rounding error in healthcare.
2. Speaker diarization
Reliably separating clinician from patient (and speaker from speaker in group settings) so the note attributes statements correctly.
3. Note generation
Turning the transcript into a real SOAP, DAP, BIRP, GIRP, or custom‑template note — and extracting structured encounter data like problems, medications, and coding candidates.
4. Evaluation and QA
The hardest and most overlooked layer: how do you prove a generated note is good? You need an evaluation harness and regression tests, or every model and prompt change silently degrades quality.
5. Clinician review
A human‑in‑the‑loop review‑and‑sign step inside your UI. The clinician is responsible for the final note, so the workflow has to make verification fast, not optional.
6. HIPAA, PHI, and your BAA
Encryption in transit and at rest, least‑privilege access, audit logging, a zero‑retention posture for audio, and a Business Associate Agreement with every sub‑processor in the chain.
7. Maintenance
Models drift, new drugs and codes appear, formats change, and prompt updates regress. This work never ends — more on it below.
8. EHR write-back
Getting the finished note and structured data back into the chart — often via FHIR — so it lands where the clinician already works.
A realistic timeline and team
Exact numbers depend on scope and how good “good enough” has to be, but a production‑grade build in behavioral health or general medicine typically looks like this:
Phase | Typical duration | Who you need |
|---|---|---|
Prototype: ASR + a first note | 1–3 months | ML / app engineers |
Clinical accuracy + diarization | 3–6 months | ASR / ML engineers, clinical SME |
Note generation + evaluation harness | 3–6 months | LLM engineers, clinical reviewers |
Review UI + EHR write-back | 2–4 months | Product / full-stack engineers |
HIPAA, security, BAA chain | 2–4 months (parallel) | Security / compliance |
Hardening to production quality | 3–6 months | The whole team |
Realistically that is 12–24+ months of calendar time and a cross‑functional team that includes specialized ASR and LLM talent plus clinical and security expertise — the hires that are hardest to find and slowest to ramp. And reaching v1 is not the finish line.
The maintenance treadmill nobody budgets for
A medical scribe is not a project you finish; it is a system you keep alive. After launch you own, indefinitely:
- Model and quality drift — output that quietly gets worse as underlying models change or your data shifts.
- A moving vocabulary — new medications, codes, and specialty terms that recognition has to keep up with.
- Prompt and format regressions — every change to a note template risks breaking others, which is why the evaluation harness from day one matters.
- Security and compliance upkeep — access reviews, audits, incident response, and keeping the BAA chain current.
- On-call for a clinical system — when documentation breaks mid-clinic, it is urgent.
This ongoing cost is the single most under‑estimated line item, and it is the strongest argument for not building unless you have to.
Build, API, or partner: a quick decision
There are three honest paths to adding medical transcription, and the right one depends on whether voice AI is your product or a feature of it.

- Build in-house — right only when clinical voice AI is your core differentiator and you can fund a multi-year team. See our build vs. partner breakdown for EHRs for that decision in detail.
- Speech-to-text API — you keep full control of the UX and wire in a medical speech-to-text API that already handles medical-grade recognition, diarization, note formats, and structured extraction. Ships in days to weeks, with a BAA available.
- White-label partner — the fastest path when documentation is a feature your users expect, not your product. A partner program lets you embed the whole experience under your own brand.
How Twofold fits
Twofold was built so you don't have to spend 12–24 months building medical voice AI from scratch. Two paths cover most builders:
- The medical speech-to-text API turns clinical audio into transcripts, finished notes (SOAP, DAP, BIRP, GIRP, and custom), and structured EHR-ready data through a single call — with speaker diarization and medical-tuned recognition built in. Best when you want to own the UX.
- The partner program offers referral, reseller, co-branded, white-labeled/embedded, and custom-integration models — EHR-agnostic, via API and webhooks. Best when you want the documentation experience inside your product as if you built it.
Both are HIPAA‑conscious with a Business Associate Agreement available, no training on customer data, and no audio retention — so you inherit the compliance posture instead of building and defending it yourself.
The bottom line
Adding medical transcription to your app is mostly not a transcription problem. The accuracy, diarization, note generation, evaluation, review, and compliance work behind a signable clinical note is where the time and money go — and it never fully stops. Build it only if clinical voice AI is the product you're in business to ship. If it's a feature your users expect, a speech-to-text API or a white-label partner gets you there in weeks, not years.

