Why a model's native audio isn't enough
A general-purpose model can listen to a short clip. It can't ingest a two-hour recording. OpenAI's speech-to-text API caps file uploads at 25 MB, and an hour of decent-quality audio clears that ceiling on its own. So an agent that needs to transcribe real recordings can't just hand the file to the model. It needs a dedicated transcription step. The consumer version of this question, whether a chat model can do it at all, is covered in can ChatGPT transcribe audio; this is the developer version.
Size isn't the only limit. Native transcription returns a flat block of text. There are no speaker labels by default, and no timestamps unless you ask. An agent that gets back one undifferentiated string can summarize it, but it can't tell you who spoke or when. For anything past a quick clip, plain model audio leaves the agent guessing.
The job, then, is to give the agent a tool that swallows long files and returns structured output. That's what a transcription MCP server does. It sits between the agent and a speech-to-text backend, takes the audio, and hands back labeled, timed segments the agent can actually reason over.
How a transcription MCP server exposes the tool
MCP is an open protocol that Anthropic open-sourced in November 2024 for connecting AI applications to external tools and data. It uses JSON-RPC 2.0 messages between hosts, clients, and servers. A transcription server plays the server role: it exposes a tool the agent can call, like transcribe(audio_url).
Servers offer three things to clients: resources, prompts, and tools. Tools are the functions a model can execute, so your transcription capability registers as one. MCP defines exactly two standard transports, stdio and Streamable HTTP. Use stdio for a local dev tool and Streamable HTTP for a hosted service other clients reach over the network.
Because MCP is an open standard, the design intent is that any conforming client can call any conforming server. Claude, an IDE agent, or your own host can use the same transcription server without custom glue. There's now an official MCP registry, announced in September 2025, for publishing and discovering servers.
The async job pattern: submit, then poll or webhook
Transcribing an hour of audio takes longer than a single request should stay open. The established fix is to accept the job and answer immediately. HTTP 202 Accepted means the request has been accepted for processing, but processing isn't complete, defined normatively in RFC 9110 section 15.3.3. Google's AIP-151 formalizes the same idea: a long-running method returns an operation object plus a token to track progress.
In practice the tool call runs in three beats. The agent submits the audio and gets back a job_id right away. Then it either polls a status endpoint until the job hits a terminal state, or registers a webhook that fires when the transcript is ready. Webhooks save you the polling loop; polling is simpler to stand up. Pepys ships this exact submit-then-poll flow as both a REST API and an MCP server; the endpoints, webhooks, and MCP setup are in the developer reference.
Design the tool so the agent is never blocked. Return the job_id on the first call so the agent can keep working, then resolve when the transcript lands. A tool that holds the line for ten minutes will trip the client's timeout long before your transcript is done.
Why an agent needs diarized, timestamped output
Plain text can't answer 'who said what.' Speaker labeling, or diarization, is a separate capability. OpenAI ships it as a distinct model, and you request the diarized_json format to get segments tagged with speaker, start, and end. Return that structure, and the agent can attribute every line. What diarization is, and where it struggles, is covered in what is speaker diarization.
Timestamps are what make an agent's answer checkable. OpenAI exposes exactly two timestamp granularities, segment and word, both opt-in. Word-level timing lets an agent cite the exact second a quote appears, not just the rough passage. If you want the agent to link back to the audio, ask for word-level and pass it through. See what a timestamped transcript is for why the offsets matter.
The structure is the difference between an agent that summarizes and one that can be fact-checked. 'The CFO said X at 14:32' is a claim someone can verify against the recording. A flat paragraph with no speakers and no times is not. Give the agent diarized, timed segments and its answers become auditable by default.
Consent and retention for an agent that records
An agent that captures or ingests conversations inherits the same consent rules a person would. Federal law requires the consent of at least one party, and about 11 US states require all parties to consent before recording. Rules also differ by country. Build consent capture into the flow rather than assuming it.
Retention is the other half. An agent pipeline shouldn't hoard raw audio it no longer needs. Pepys never trains on your audio or text, and source media auto-deletes after processing rather than sitting on a server indefinitely. For sensitive material, delete the audio once the transcript is written and keep only what the agent actually uses.
Treat the transcript as the artifact, not the recording. Once you have diarized, timestamped text, the agent rarely needs the audio again. Keeping less is both a privacy posture and one fewer thing to secure.