AI Podcast Workflow: How to Make an AI Podcast (2026)

AI hosts, voice clones, multi-track production, distribution. The studio podcast workflow, in development now. Founder pricing locked for early signups.

MZ Mike Zapata · Last updated May 19, 2026 · 29 min read

Lock Founder Access to AI Podcast Logic.

Founder pricing locked at signup. 48h advance access. No spam.

In this guide ›

KEY TAKEAWAYS

An AI podcast workflow stacks voice cloning, AI-assisted scripting, and audio editing into a 3-6 hour episode pipeline.
ElevenLabs Professional Voice Clone is the current quality leader for long-form spoken content over Resemble, Play.ht, and open-source alternatives.
Per-episode cost runs 25 to 200 dollars in tooling versus 800 to 3000 for traditional production with a freelance producer.
Spotify, Apple, and YouTube allow AI-generated audio with proper disclosure; failing to disclose risks platform takedown and FTC exposure.
The voice clone is not the moat. Script quality, sound design, and a real point of view are what separate AI podcasts that get listened to from ones that get skipped.

An ai podcast workflow is the end-to-end production system for generating a podcast episode using AI voice cloning, AI-assisted scripting, and AI-aware editing tools. The modern version assembles ElevenLabs for voice synthesis, a writing tool like Claude or GPT for scripting, and either an all-in-one platform like Wondercraft or a stitched stack of Descript and Riverside.fm for editing. A working solo workflow gets you from idea to published episode in three to six hours, at a tooling cost of 25 to 200 dollars per episode, with quality that holds up against mid-tier traditional shows when the script is good.

What is an AI podcast
Format options for AI podcasts
The cost comparison: AI podcast vs traditional
Voice cloning with ElevenLabs: full workflow
Multi-voice production for interview and dialogue shows
Script writing for AI hosts
The end-to-end production workflow
Audio editing and post-production
Music, SFX, and mix decisions
Distribution: Spotify, Apple, YouTube
Monetization paths for AI podcasts
Disclosure requirements and legal exposure
Tool comparison: Wondercraft, Jellypod, Riverside, Descript
Common pitfalls that kill AI podcasts
Frequently asked questions

Caption: The four-stage AI podcast pipeline: source recording, voice training, episode generation, and platform distribution.

What is an AI podcast

An AI podcast is an audio show whose hosts, voices, or narration are produced using generative AI, typically a voice cloning model trained on a real speaker or a stock synthetic voice. The script can be human-written, AI-drafted with human editing, or generated entirely from a prompt. The defining property is that the voice you hear was synthesized, not recorded live. The category covers everything from solo-host explainers using a creator's own cloned voice to fully synthetic news briefings with no human in the recording booth at all.

The format is not new. NotebookLM's Audio Overviews popularized AI podcast discussion-style audio in late 2024. What changed in 2025 and 2026 is that production tools matured to the point where voice quality crossed the threshold of being tolerable for long-form listening. ElevenLabs v3, released in mid-2025, was the inflection point for prosody and emotional range. Before v3, AI narration sounded like a smart speaker reading a Wikipedia article. After v3, it sounds like a competent voiceover artist on a flat day. That is the floor that makes the rest of the workflow viable.

The studio uses AI voice not for Ava Moreno's feed, which is visual, but for the operator account's adjacent audio experiments and for client work where founders want a podcast voice without recording themselves. The same toolchain that produces a cloned voice for Mike Zapata produces the voiceover that runs under a brand client's product video.

Format options for AI podcasts

The format you choose determines the difficulty of the production. Some formats survive AI narration easily. Others struggle.

Narrated explainer or essay. A single voice reading a tightly written script. This is the easiest format to produce with AI because there is no conversation, no interruption, and no live dynamic. NotebookLM's Audio Overviews and most AI-native podcasts on Spotify use a variant of this. Episode length typically 8 to 25 minutes.

Two-host conversation. Two voices in dialogue. Significantly harder because pacing, interruption, and reaction beats matter. Done well, this is the format that fooled most listeners during NotebookLM's launch. Done badly, it sounds like two text-to-speech engines reading email back and forth.

Interview. A cloned host voice asking questions to a real recorded human guest. This is the format that actually scales for founders and operators. The host clone reads questions, the guest answers in real audio, and the result is closer to a traditional interview show than a fully synthetic one.

News briefing. Short daily or weekly digest of a category, typically narrated. Jellypod is purpose-built for this. The format tolerates AI well because listeners want efficiency, not personality.

Educational lecture. Long-form single voice teaching a topic. Works when the script is dense and the voice clone is trained well. Falls apart when stability is set too high and the delivery turns monotone.

Branded show. A company commissions a fully scripted, voice-cloned host to deliver content tied to their product. This is a real market and the one with the cleanest commercial path. Pricing for done-for-you AI branded podcast production currently runs 1,500 to 5,000 per episode in the agency tier.

The decision is not about which format is best. It is about which format your script writing capability can sustain. AI voice is now competent. AI scripts are not yet competent at the level a good human writer produces. The format that wins is the one where you can write tightly enough to carry the voice.

The cost comparison: AI podcast vs traditional

The cost difference is the headline most people lead with. The reality is more textured than the headlines suggest.

Cost line	AI podcast (solo, 30 min episode)	Traditional podcast (interview, 45 min)
Host time	2-4 hours scripting + review	2-3 hours prep + recording
Voice / talent	8-25 in ElevenLabs credits	0 (your own voice) or 200-800 per episode for hired host
Editor	0-50 in Descript subscription share	200-600 per episode freelance
Sound design / music	10-30 in stock or AI music credits	50-200 per episode
Studio / equipment	0 (already owned)	0-300 for studio rental
Hosting platform	5-15 per month (Transistor, Captivate, Buzzsprout)	5-15 per month (same)
Per-episode tooling cost	25 to 200 dollars	450 to 1,900 dollars
Per-episode all-in (incl. labor)	200 to 500 dollars	800 to 3,000 dollars

The savings are real. They are also misleading if you stop reading at the per-episode tooling line. The labor cost is the dominant cost in both columns. AI compresses the tooling line by 70 to 90 percent. It compresses the labor line by maybe 30 percent, mostly by removing recording and basic editing. Script writing, fact checking, story structure, and review still take real time.

The other thing the table does not capture: an AI podcast that gets ignored costs more than a traditional podcast that gets listened to. The cost line that matters for a working show is cost per listener-hour, not cost per episode. Cheap production that produces a show no one finishes is a worse outcome than expensive production that produces a show people return to.

"The cost of AI media is not zero. The cost of AI media that no one wants to consume is still the entire production cost, plus the opportunity cost of the time you spent." , Operating principle, CinematicDirector.ai studio notes

Voice cloning with ElevenLabs: full workflow

ElevenLabs is the load-bearing tool in this pillar. Voice quality is the difference between a podcast that listeners tolerate and one they finish.

Step 1: Record the source corpus

Record 30 minutes of clean, varied speech. Use a dynamic mic in a treated or naturally dead-sounding room. A Shure SM7B with a Cloudlifter into a Focusrite interface is the standard semi-pro setup and produces source quality high enough for Professional Voice Clone. A USB condenser like the Shure MV7 also works and is less fussy about room acoustics.

Vary the content. Read declarative sentences, ask questions, deliver an emotional passage, list proper nouns, count slowly. ElevenLabs trains on prosody patterns, and a corpus that only contains calm reading produces a clone that can only do calm reading.

Step 2: Clean the audio

Strip room tone, mouth clicks, and background noise. iZotope RX is the professional choice; Audacity's noise reduction works for hobbyist budgets. ElevenLabs trains on exactly what you feed it. A 30-minute file with audible HVAC hum in the background will produce a clone that whispers HVAC hum under everything it ever says.

Step 3: Upload to Professional Voice Clone

Professional Voice Clone, not Instant Voice Clone. Instant runs on 1 to 3 minutes of audio and produces a recognizable but limited model. Professional runs on the full corpus and produces a model with much better prosody, emotional range, and long-form stability. Professional Voice Clone requires either Creator or Pro tier subscription; Instant is available on lower tiers.

Training takes 2 to 4 hours. The system ingests the corpus, builds the voice model, and surfaces it in your voice library when ready.

Step 4: Test against a hard script

Generate three test passages. A calm monologue. An emotional moment, written with deliberate intensity. A list of proper nouns including foreign names and technical terms. These three together expose every weakness a voice clone has. Plastic artifacts show up in the calm passage. Breath placement errors show up in the emotional passage. Pronunciation errors show up in the proper noun list.

If the test passages sound wrong, the problem is almost always the source corpus, not the model. Re-record with better acoustics and more variety before iterating on slider settings.

Step 5: Tune the stability and similarity sliders

ElevenLabs exposes two sliders that determine how the voice model generates new audio. Stability controls how consistent the delivery is across generations. Similarity controls how closely the model adheres to the source voice characteristics.

Stability between 0.40 and 0.55 keeps emotional variation without sounding erratic. Higher stability flattens delivery to monotone, which is what most amateur AI podcasts sound like. Lower stability produces unpredictable swings that work for character voices but destroy long-form coherence.

Similarity between 0.75 and 0.85 keeps the voice recognizably yours without overfitting to recording artifacts in the source. Pushing similarity above 0.90 starts to reproduce mouth sounds and breath patterns from the original corpus that you probably do not want in every sentence.

Step 6: Generate the episode and clean artifacts

Generate in chunks of 500 to 800 words. Longer generations drift in prosody as the model loses track of the overall pacing intended for the episode. Shorter generations break too many natural sentence groupings.

Listen to every chunk before stitching. Re-roll any sentence that has a robotic cadence. Re-rolls are cheap; bad cadence in a published episode is expensive. Stitch in your DAW with 50ms crossfades. Apply gentle compression (3:1 ratio, -18dB threshold) and a touch of room reverb (300-500ms decay, very low mix) to break the recording-booth dryness that AI voices default to.

"The plastic feeling people associate with AI voice is usually post-production, not the model. AI voices come out dry, EQ-flat, and reverb-less. Real podcast voices live inside a room. Add the room back." , Common engineering note across professional AI podcast production setups

Multi-voice production for interview and dialogue shows

Two-host AI podcasts are harder than single-host narration. The voice quality bar is the same. The conversation quality bar is higher because listeners use interruption, reaction timing, and overlap as authenticity signals. AI-generated dialogue that sounds like two voices reading prepared statements at each other fails this signal check.

Workable approaches in 2026:

Scripted dialogue with deliberate imperfections. Write the script as a normal two-person conversation, then add reaction beats, false starts, and trail-offs. "Right, and the other thing is, well, actually, hold on." These imperfections cost nothing to add and are what real conversations contain. NotebookLM's Audio Overviews succeed largely because their underlying model was trained to insert these patterns automatically.

Real interview with cloned host. The host clone reads questions; the guest is real recorded human audio. This is the most listener-friendly format and the easiest one to sell to clients. Riverside.fm handles the human-guest side of this well, and you can drop the cloned host audio into the timeline alongside the live recording.

Single voice playing multiple characters. Not a real workaround for dialogue shows, but useful for narrative formats where one narrator voices different characters. ElevenLabs allows multiple voice models on a single account, so you can train two or three clones of yourself (perhaps with deliberately different vocal registers) and switch between them.

Avoid: two flat AI voices with no reaction beats, no interruption, no varied pace. This is the dominant failure mode for AI podcast dialogue and the reason listeners drop after one minute.

Script writing for AI hosts

Voice quality is the floor. Script quality is the ceiling. Almost every failed AI podcast loses listeners because the script is mediocre, not because the voice sounds wrong. The voice clone reads what you give it. Give it nothing interesting and the clone will deliver nothing interesting in beautiful audio.

Effective patterns for AI podcast scripts:

Lead with the strongest claim. AI podcasts do not get the benefit of the doubt that human-hosted shows do. Listeners give a synthetic host 30 to 45 seconds to justify the press of play. The opening sentence has to deliver a real claim, not a teaser. "Today we are talking about" is the death rattle of an AI podcast intro. "The interest rate cut last week was the wrong decision, and here is why" is an opening that earns the next 45 seconds.

Write for the ear, not the eye. AI voice clones read written copy literally. Sentences that work on a page often fail when spoken: too many subordinate clauses, lists longer than three items, parenthetical asides. Cut every sentence to its working length. Read every paragraph out loud during writing. If you stumble on a sentence, the clone will too.

Use named specifics. AI voices are flattest when they read generalities. "The company released a new product" is hard to deliver compellingly. "Anthropic released Claude 4.5 on October 7, 2025" has texture the voice can lean into. Specifics give the synthesis engine emotional anchors.

Structure for completion, not depth. Listeners drop out at section transitions. Build episodes as three to five short segments rather than two long ones. Each segment opens with its own hook. This is the same principle that drives YouTube retention curves and it transfers cleanly to audio.

Cut motivational filler. Anything that sounds like a self-help podcast intro flattens an AI voice instantly. No "in today's episode we are going to unlock the secrets of". No "without further ado". No "let's dive in". Treat every word as if you are paying ElevenLabs by the character, because you literally are.

The studio drafts most scripts with Claude using a brand voice prompt that mirrors the lowercase observational register from Ava's feed. AI assists writing; AI does not finish writing. Human edit on every script removes filler, sharpens claims, and inserts the breath beats that make voice synthesis sound natural.

The end-to-end production workflow

This is the studio's reference workflow for producing a 25-minute narrated AI podcast episode end to end. Time estimates assume the voice clone is already trained.

Hour 1: Script draft

Brief the LLM with the topic, the desired claim, and the brand voice spec
Generate a 2,500 to 3,000 word draft
Read once for structural problems (weak opening, missing claim, soggy middle)
Re-prompt for specific fixes

Hour 2: Script polish

Cut every word that does not earn its space
Read aloud, flag sentences that stumble
Tighten to 2,200 to 2,500 words for a 25-minute episode at typical AI narration pace
Insert pause markers [pause] and emphasis markers where ElevenLabs supports them

Hour 3: Voice generation

Generate in 500-800 word chunks via ElevenLabs API or studio UI
Listen to every chunk; re-roll any sentence that sounds off
Save each chunk as WAV at 44.1kHz

Hour 4: Edit and mix

Stitch chunks in Reaper, Logic, or Audacity
Apply compression chain (gentle compressor, light limiter)
Add subtle room tone and reverb to break the dryness
Layer intro music, outro music, and any segment transitions
Export master at -16 LUFS (the broadcast standard for podcast platforms)

Hour 5 (optional): Distribution prep

Write episode show notes, including AI disclosure line
Generate or write episode title and description optimized for search
Upload to host (Transistor, Captivate, Buzzsprout, Megaphone)
Schedule release

Total: 4 to 5 hours of operator time per published episode. A second voice (for dialogue formats) typically adds 60 to 90 minutes. A research-heavy interview format that uses a real human guest adds 1 to 3 hours for interview prep and live recording but reduces the voice generation step.

Audio editing and post-production

The single highest-leverage post-production move on AI podcast audio is adding back the room. AI voice synthesis produces audio that sounds like it was recorded in an anechoic chamber. Real podcast voices live in a room. Adding that room back is what separates AI podcasts that sound passable from ones that sound uncanny.

The basic chain:

Compression, gentle 3:1 or 4:1 ratio, -18dB threshold, 10ms attack, 100ms release. The goal is to flatten dynamic range slightly, not to squash. ElevenLabs output is already fairly consistent, so heavy compression is unnecessary.

EQ, high-pass at 80Hz to remove sub-frequency rumble that has no business in spoken content. Gentle cut around 200-300Hz if the voice sounds muddy. Slight presence boost at 3-5kHz if the voice sounds buried.

Reverb, short room reverb, 300-500ms decay, mix at 5-12 percent. This is the room you are adding back. Use a sparse, dry reverb; cathedral-style reverbs sound theatrical.

De-essing, only if necessary. ElevenLabs occasionally produces sibilance artifacts on hard S sounds.

Limiter, final brick wall at -1 dBFS to prevent clipping on platforms that re-encode.

Loudness target, -16 LUFS integrated for Spotify, Apple, and most podcast platforms. YouTube targets -14 LUFS. Use a meter (FabFilter Pro-L 2, iZotope Insight, or free options like LoudMax) to verify.

Descript is the dominant editing tool in 2026 because of its text-based editing. You edit the transcript, and the audio edits with it. For AI podcasts, this is particularly useful: re-rolls happen at the script level, you regenerate one sentence in ElevenLabs, paste it back into Descript, and the timeline adjusts automatically.

Music, SFX, and mix decisions

Music makes AI podcasts feel finished or amateur. Bad music choice is one of the loudest amateur tells in the format. Three rules:

Use music that fits the show register, not music that is "podcast-y". The default loop on a free podcast intro library is a generic acoustic guitar plus light percussion that screams "amateur podcast". A short instrumental piece picked specifically for the show's tone reads as deliberate.

Music under voice should be ducked aggressively. AI voices are already fighting for clarity. Music that sits above -25dB under voice will fight the synthesis and produce a muddy mix.

Use restraint on SFX. Stings, transitions, and effect layers work in tightly produced narrative podcasts (Serial-style). They do not work in conversational or essay formats and they will make an AI podcast sound like a content farm if overused.

Sources for music: Soundstripe and Artlist for traditional licensed libraries. Suno and Udio for AI-generated music; both produce usable instrumental beds in 2026, though licensing terms for commercial use are evolving and worth verifying.

Distribution: Spotify, Apple, YouTube

The platforms that matter for AI podcast distribution are the same platforms that matter for all podcasts. The AI-specific considerations are layered on top.

Platform	AI policy (May 2026)	Disclosure required	Discovery mechanism	Notes
Spotify	Permitted with disclosure; impersonation prohibited	Yes, in show description	Editorial playlists, category charts, AI-driven recommendations	Largest podcast platform globally; AI labeling is platform-enforced
Apple Podcasts	No AI-specific restriction	Best practice, not mandated	Editorial features, charts, search	Tighter editorial filter than Spotify
YouTube (audio + video)	Permitted with disclosure	Yes, via "Altered content" toggle on upload	Search, recommendations, suggested videos	Treats podcasts as videos; static-image episodes underperform
Amazon Music	Permitted with disclosure	Yes	Discovery limited; smaller share	Worth uploading but rarely a primary growth channel
Pocket Casts, Overcast	No AI restriction	Follows host show notes	Direct subscribers	Power-listener platforms

For an AI podcast distribution strategy, the practical move is to publish on all major platforms via a podcast host that distributes everywhere (Transistor, Captivate, Buzzsprout, Megaphone), then invest discovery effort in the two that matter most for your category.

Spotify is the largest and the one with the most aggressive AI-content awareness. Spotify added AI-content tagging in 2024 and continues to expand it. The platform does not currently penalize AI-tagged content in discovery, but it does surface the tag prominently on the episode page. The studio's posture: lean into the disclosure rather than try to hide it. Listeners who reach an AI-tagged podcast and then enjoy it have already self-selected past the AI objection.

YouTube is the second-largest podcast platform by listening hours in 2026. The format on YouTube is video-first, which creates an interesting question for AI podcasts: do you publish as static audio with a single image, or do you build a video layer? Static audio underperforms. A simple animated waveform layer or a looping ambient image performs noticeably better. AI talking-avatar videos using HeyGen Avatar V are an emerging third option, addressed in the AI talking avatar workflow pillar.

Monetization paths for AI podcasts

AI podcasts monetize through the same channels as traditional podcasts. The unit economics are better because production cost is lower. The audience economics are worse because AI shows generally take longer to grow trust with listeners.

Advertising. Programmatic ad networks (Acast, Megaphone, Libsyn AdvertiseCast) place dynamic ads into your episodes. CPM rates for podcast ads run 18 to 30 dollars per thousand impressions in 2026 for premium inventory, 8 to 15 dollars for general inventory. AI podcasts can take ads. Some networks require disclosure of AI-generated content to advertisers.

Sponsorship. Direct sponsor deals at flat per-episode or per-month rates. Pricing ranges from 25 to 100 dollars per thousand listens for direct-response sponsors, 50 to 200 for brand sponsors. AI podcasts can sell sponsorships; the disclosure question is the same. Some brands are cautious about associating with synthetic hosts; others actively prefer the predictability.

Subscription. Apple Podcasts Subscriptions and Spotify Open Access let creators charge for premium episodes or ad-free feeds. Patreon remains the dominant third-party option. Subscription models work for AI podcasts that have either a niche-specific audience or a distinct format value (daily briefings, exclusive analysis).

Lead generation for a primary business. The dominant model for the studio's positioning. An AI podcast that reaches founders, operators, or creators becomes a top-of-funnel asset for selling Studio Build, Done-for-You services, or a future AI Podcast Logic product. The audience does not need to be large; it needs to be the right audience.

Branded podcasts for clients. Done-for-you AI podcast production for brands runs 1,500 to 5,000 per episode in agency pricing. The category is small but growing. Clients are typically founders or marketing teams that want a podcast voice without the production load. Studio Build customers occasionally cross-sell into this.

Disclosure requirements and legal exposure

Disclosure for AI podcasts is now mandatory in most major jurisdictions and on most major platforms. The legal landscape changed materially in 2025 and 2026.

United States. FTC guidance issued in late 2024 requires disclosure of AI-generated content in advertising and endorsement contexts. The FTC has stated that synthetic voices used to deliver branded content fall within the same disclosure rules as human endorsers. For non-advertising content, federal disclosure is not yet mandated, but state-level deepfake statutes in California, Texas, Tennessee, and Florida restrict the use of synthetic voices that impersonate real people. The risk is highest if your podcast clones a public figure.

European Union. The EU AI Act entered general-purpose AI provisions in August 2026. Synthetic audio that depicts a real person must be clearly labeled. Synthetic content distributed in the EU must include machine-readable provenance metadata where technically feasible. This applies to your podcast if it is available to EU listeners through Spotify, Apple, or any other distribution channel that reaches the EU.

Platform policies. Spotify requires AI disclosure in show descriptions for fully AI-generated audio. YouTube requires AI disclosure via the "Altered content" toggle. Apple Podcasts recommends but does not yet enforce disclosure. TikTok requires the in-app AI content toggle for audio overlays.

Practical disclosure pattern. The studio's posture is to disclose proactively in three places: the show description on every platform, the first 30 seconds of every episode, and the episode show notes. The disclosure line is short: "this podcast is produced with AI-generated voice, hosted by [name]'s cloned voice model." Listeners who care about disclosure get it. Listeners who do not care experience minor friction at most.

The cost of over-disclosure is small. The cost of under-disclosure is platform removal, FTC exposure, and reputational damage that propagates faster in the AI-aware media environment than it would have five years ago.

Tool comparison: Wondercraft, Jellypod, Riverside, Descript

The four tools most often considered for AI podcast production each occupy a different position in the workflow. The comparison below addresses where each tool wins.

Tool	Best for	Voice quality	Editing depth	Pricing (May 2026)	All-in-one or layer
Wondercraft	Solo creators who want one-pane production	Good (multiple voice providers integrated, including ElevenLabs)	Moderate; built-in editor	$39 to $159 per month	All-in-one
Jellypod	Template-driven daily / weekly podcasts at high volume	Good (ElevenLabs integration)	Light; template-first	$19 to $99 per month	All-in-one
Riverside.fm + AI	Hybrid shows with real guests and AI host	High (live recording quality for guests; cloned host via integration)	Strong; full editor	$24 to $99 per month	Layer (recording + editing)
Descript	Assembled-stack workflows; text-based editing	High when paired with ElevenLabs	Best in class; text-based	$24 to $50 per month per editor	Layer (editing)
ElevenLabs (voice only)	Voice cloning across all of the above	Best in class	None	$11 to $99+ per month, scales with usage	Layer (voice)
DIY stack (ElevenLabs + Reaper + Auphonic)	Maximum control, technical operators	Best in class	Full DAW	$20 to $100 per month combined	Layer (assembled)

The decision tree:

If you want the fastest path from idea to published episode and you accept moderate editing flexibility, choose Wondercraft. The studio uses Wondercraft for client-branded podcast production where speed matters more than maximum control.

If you publish daily or weekly news briefings and want template-driven production at high cadence, choose Jellypod. The category leader for AI-generated daily podcasts.

If your show is an interview format with real human guests, choose Riverside.fm + ElevenLabs. Riverside handles the live recording infrastructure; ElevenLabs handles the AI host voice.

If you want maximum quality and you are willing to assemble the workflow, choose ElevenLabs + Descript as the core stack, optionally with Reaper or Logic for final mixing. This is the studio's default for the operator account's own audio output.

"The all-in-one tools are five times faster. The assembled stack is two notches higher in quality. Pick the trade based on whether speed or polish is the constraint." , Operating principle, CinematicDirector.ai production notes

Common pitfalls that kill AI podcasts

Specific failure modes the studio has seen, in projects of our own and in client work.

Stability slider set too high. The dominant cause of monotone, lifeless delivery. Stability above 0.65 collapses prosody. Listeners drop out within 90 seconds.

Insufficient source corpus. Training Professional Voice Clone on 5 to 10 minutes of audio (the bare minimum) produces a model that works for short content but breaks down on long-form. Use the full 30 minutes. The quality difference is large.

No room reverb in post. The dry-recording-booth sound is the loudest AI tell. Add 300-500ms of light room reverb at 5-12 percent mix. This single move closes most of the perceptual gap with human-recorded podcasts.

Reading bad writing well. The voice clone is a delivery mechanism, not a script doctor. A mediocre script delivered by a perfect voice clone produces a mediocre podcast. Most AI podcast failures are script failures wearing voice clothing.

Skipping the disclosure. Removed from Spotify, demonetized on YouTube, exposed to FTC action if any sponsorship is involved. Cheap to avoid, expensive to fix.

Cross-platform identical uploads. Each platform's algorithm weights upload velocity and recency. Stagger by 24 to 48 hours. The same lesson applies as in the persona-feed cross-posting rules from the AI persona generator pillar.

Generic music underneath voice. Free podcast intro music kills the brand register. Pick music that matches the show's tone deliberately, or no music at all.

Inconsistent voice across episodes. ElevenLabs voice models persist, but slider settings do not save automatically. Save your generation presets and reuse them. Episodes that drift in vocal character feel sloppy.

Episode length mismatch with format. A 90-minute AI podcast on a non-conversational format will exhaust listener tolerance. AI narration sustains 20 to 40 minutes well; beyond that, even excellent voice clones start to wear.

ABOUT THE AUTHOR

Mike Zapata is the founder of CinematicDirector.ai, the AI-native creative studio behind Ava Moreno (@theavamoreno), built and launched in May 2026 using the same identity-consistent AI workflows documented in Studio Logic. The studio uses ElevenLabs for the operator account's own audio output and for client-commissioned AI podcast production. Mike has personally trained voice models for the studio's own use and for several brand clients, and helps founders and operators build AI-native media operations across both visual and audio channels.

About the studio · See Ava Moreno →

FREQUENTLY ASKED QUESTIONS

Q: Can AI podcasts rank on Spotify and Apple Podcasts?

A: Yes. Spotify and Apple Podcasts do not currently block AI-generated audio from rankings or search. Spotify added an explicit AI policy in 2024 that permits AI voice content as long as it does not impersonate identifiable people without consent. Apple has no AI-specific restriction. What matters for rankings is the same as it has always been: completion rate, follower growth rate, and how often listeners return for episode two. AI-narrated shows can hit category charts; several have. The catch is that listener tolerance for plastic narration is low, so quality of synthesis is the floor.

Q: Do AI podcasts need to disclose that they are AI-generated?

A: Disclosure is best practice and increasingly required. Spotify requires creators to label fully AI-generated audio in the show description. The EU AI Act, in force since August 2026 for general-purpose AI systems, requires clear labeling of synthetic audio. FTC guidance in the US requires disclosure if a brand sponsors the show and the host is AI. Practical rule: put a line in the show description and in every episode's first 30 seconds noting that the voice is AI-generated. The honest framing tends to outperform the cloak-and-dagger one.

Q: How much does it cost to make an AI podcast versus a traditional podcast?

A: An AI podcast can run between 25 and 200 dollars per episode in tool subscriptions, depending on length and how many voices you use. A traditional podcast with a freelance producer, host fee, editor, and studio time typically runs 800 to 3,000 dollars per episode for a 45-minute interview show. AI removes the production line. It does not remove the cost of a good script, a real interview booking, or sound design. The savings are real but they shift where you spend, not whether you spend.

Q: What is the best AI voice cloning tool for podcast hosts?

A: ElevenLabs Professional Voice Clone is the current category leader for long-form spoken content. It handles prosody, breath, and emotional inflection more naturally than Resemble, Play.ht, or open-source alternatives like Coqui. The trade-off is cost: ElevenLabs scales by character count, and a 30-minute episode can consume meaningful credits on a Creator plan. For multi-host shows, voice consistency across episodes is the bigger problem than first-episode quality, and ElevenLabs handles that well because trained voice models persist.

Q: Can I use a cloned voice that is not mine?

A: Only with documented consent from the person whose voice is being cloned. ElevenLabs requires a recorded consent statement before training a Professional Voice Clone of someone other than the account holder. Cloning a public figure without consent violates the platform terms and likely violates state-level deepfake statutes in California, Texas, Tennessee, and others. The legal exposure is real and increasing. For show hosts, clone yourself or use one of the platform's stock voices.

Q: How do I make AI podcast voices sound natural and not robotic?

A: Three things: source recording quality, stability slider tuning, and post-processing. Source matters most. Thirty minutes of varied, emotionally textured speech recorded in a treated room produces a noticeably better clone than 30 minutes of monotone scripted reading. Stability around 0.40 to 0.55 leaves room for emotional variation. Post-processing matters more than people admit. Apply gentle compression, a touch of room reverb to break the recording-booth dryness, and remove any breath sounds that feel mechanical. The robotic feeling usually lives in the polish, not the model.

Q: What is the best AI podcast generator for solo creators?

A: Wondercraft is the strongest all-in-one for solo creators who want a single pane of glass: script in, episode out, distribution included. Jellypod is the alternative if you want template-driven daily podcasts at higher volume. For creators who want maximum control and are willing to assemble the workflow themselves, the stack is ElevenLabs for voice plus Descript for editing plus a hosting platform like Transistor or Captivate. The all-in-one is faster. The assembled stack produces better-sounding output because each tool is best-in-class for its layer.

→ AI talking avatar workflow → AI persona generator workflows → Best AI influencer generator tools → AI UGC creator workflow `` → How to make an AI influencer

Want to go deeper? Read the complete guide: AI persona generator workflows →

WORK WITH THE STUDIO

STUDIO BUILD

The full studio workflow library plus 90 days of new workflows for identity-consistent AI media, covers the persona side of the operation. Includes voice and audio modules as they ship. [Get Studio Build access →]

AI PODCAST LOGIC, COMING SOON

The dedicated AI podcast production system: voice cloning workflows, episode templates, sound design presets, and the studio's reference scripts. Launching late 2026.

SOURCES

ElevenLabs. "Professional Voice Cloning documentation." ElevenLabs Help Center, accessed May 2026. https://help.elevenlabs.io/
Spotify. "Spotify AI Content Policy." Spotify for Creators, updated 2024-2025. https://creators.spotify.com/
European Parliament. "Regulation (EU) 2024/1689 (Artificial Intelligence Act)." Official Journal of the European Union, June 2024. https://eur-lex.europa.eu/
Federal Trade Commission. "Endorsement Guides and AI-Generated Content." FTC, 2024-2025. https://www.ftc.gov/business-guidance/advertising-marketing/endorsements-testimonials
NotebookLM team. "Audio Overviews launch documentation." Google NotebookLM, 2024. https://notebooklm.google/
Wondercraft. "Pricing and features." Wondercraft, accessed May 2026. https://www.wondercraft.ai/ ``
Jellypod. "Product documentation." Jellypod, accessed May 2026. https://jellypod.ai/ ``
Riverside.fm. "AI features and integrations." Riverside, accessed May 2026. https://riverside.fm/
Descript. "Podcast editing documentation." Descript, accessed May 2026. https://www.descript.com/

Want more on AI-native media workflows?

Weekly. No spam. Unsubscribe anytime.

Mike Zapata

Founder · CinematicDirector.ai

Mike Zapata is the founder of CinematicDirector.ai, the studio behind @theavamoreno. Built and launched in May 2026 using the same identity-consistent AI workflows documented in Studio Logic. He also operates ListingDirector.ai and Mike Zapata Real Estate.

See Ava's work → · About the studio

The Proof Artifact

Built with this system. Posting daily.

@theavamoreno is the studio's first AI persona. Face-consistent, voice-cloned, posting every day. Every reel uses the exact workflow documented above. She is the live demo.

Follow @theavamoreno

AI Podcast Workflow: How to Make an AI Podcast (2026)

KEY TAKEAWAYS

CONTENTS

What is an AI podcast

Format options for AI podcasts

The cost comparison: AI podcast vs traditional

Voice cloning with ElevenLabs: full workflow

Step 1: Record the source corpus

Step 2: Clean the audio

Step 3: Upload to Professional Voice Clone

Step 4: Test against a hard script

Step 5: Tune the stability and similarity sliders

Step 6: Generate the episode and clean artifacts

Multi-voice production for interview and dialogue shows

Script writing for AI hosts

The end-to-end production workflow

Audio editing and post-production

Music, SFX, and mix decisions

Distribution: Spotify, Apple, YouTube

Monetization paths for AI podcasts

Disclosure requirements and legal exposure

Tool comparison: Wondercraft, Jellypod, Riverside, Descript

Common pitfalls that kill AI podcasts

ABOUT THE AUTHOR

FREQUENTLY ASKED QUESTIONS

WORK WITH THE STUDIO

STUDIO BUILD

AI PODCAST LOGIC, COMING SOON

SOURCES

Built with this system. Posting daily.

Build the AI version of you. Start free.

AI Podcast Workflow: How to Make an AI Podcast (2026)

KEY TAKEAWAYS

CONTENTS

What is an AI podcast

Format options for AI podcasts

The cost comparison: AI podcast vs traditional

Voice cloning with ElevenLabs: full workflow

Step 1: Record the source corpus

Step 2: Clean the audio

Step 3: Upload to Professional Voice Clone

Step 4: Test against a hard script

Step 5: Tune the stability and similarity sliders

Step 6: Generate the episode and clean artifacts

Multi-voice production for interview and dialogue shows

Script writing for AI hosts

The end-to-end production workflow

Audio editing and post-production

Music, SFX, and mix decisions

Distribution: Spotify, Apple, YouTube

Monetization paths for AI podcasts

Disclosure requirements and legal exposure

Tool comparison: Wondercraft, Jellypod, Riverside, Descript

Common pitfalls that kill AI podcasts

ABOUT THE AUTHOR

FREQUENTLY ASKED QUESTIONS

RELATED GUIDES

WORK WITH THE STUDIO

STUDIO BUILD

AI PODCAST LOGIC, COMING SOON

SOURCES

Built with this system. Posting daily.

Build the AI version of you. Start free.