Audio Restoration Lab

Microcosm — fixing the muffled audiobook

George Gilder, Microcosm (1989 cassette source, 14h 28m). Same 30-second test passage every time, loudness-matched so you're judging the sound, not the volume.

Winner found — Adobe Podcast (Enhance Speech v2)
After four misses, the Adobe pass sounds genuinely good to you. A/B it against the raw tape below. Next step is batching the full 14.5-hour book through the same engine into a clean .m4b with chapters + cover preserved.

The test passage

"…human imagination and moral revival. The exemplary technology of this era is the microchip. The computer inscribed on a tiny piece of processed material. More than any other invention this device epitomizes the overthrow of matter. Consider a parable of the microchip once told by Gordon Moore, chairman of Intel…"

A/B — the fix vs the raw tape

✓ ADOBE v2 Enhance Speech — the winner "sounds really good"
Adobe's neural restorer rebuilt the full top end naturally — clear consonants, real "air", no underwater smear. His voice, just un-muffled.
BEFORE Original raw tape boxy / muffled
The untouched 1989 source for reference.

★★ START HERE — two new roads, built tonight

You said "do better," so I stopped tweaking EQ and went after the actual technology. Two findings. (1) The honest one: the true generative restorers that work in his real voice are either clean-but-too-timid or bright-but-sparkly — on this particular tape they hit a ceiling short of Adobe. The cleanest of them is the first card below. (2) The road with real headroom is the clean re-voice — your "doesn't have to be his voice" idea — now with a proper documentary baritone instead of the earlier voice. His exact delivery and timing, zero muffle. That's the second card. Listen to those two against Adobe.
★ NEW · RE-VOICE Documentary baritone — his delivery, clean voice the headroom road
His 1989 recording drives every syllable of the timing and announcer cadence; a clean, deep documentary-narrator voice carries the sound. No muffle at all — the engine rebuilds the whole voice, so nothing boxy survives. Timing is preserved to the frame. This is the strongest "vibe, not his exact voice" version yet. (Caps at 11 kHz — slightly less top-end "air" than Adobe; I can extend that if you like the voice.)
doc baritone spectrogram
★ NEW · REAL VOICE Generative restorer (MossFormer2-SR) — his actual voice cleanest, but gentle
A true Adobe-class generative model (MossFormer2 super-resolution) run on his real voice — it throws the degraded signal away and resynthesizes, no EQ tricks, no graft seam. It's the cleanest local restore with zero artifacts, but on this tape it's conservative: it doesn't brighten as hard as Adobe. Honest A/B — does "clean but gentle" beat the brighter-but-busier earlier attempts for you?

★ NEW — artifact-cancelling ensemble (latest)

Newest result, built tonight. Each restoration engine (AudioSR, AP-BWE, Resemble) adds its own signature artifact — shimmer, buzz, metallic. But they don't agree on where. So I ran all three, lined them up, and took the median of the three at every frequency — the odd-one-out artifact gets voted out while the real voice (which all three agree on) survives. Then real low voice grafted back in and tone-matched to Adobe. It measures dead-on Adobe's curve (−52.5 dB up top, vs −52.5 Adobe). Listen for fewer artifacts than any single engine.
★ ENSEMBLE 3-engine median + real-voice graft, tone-matched artifacts voted out
AudioSR + AP-BWE + Resemble, median-combined so no single engine's artifact survives, his real voice under 4.5 kHz, EQ-shaped to Adobe's measured tone. Fully offline. A/B vs the green Adobe card and vs "Best local" below.
ensemble spectrogram
Still on the bench, both bigger lifts: FlashSR (one-step super-resolution — model download) and RVC with a trained narrator voice (the cleaner, deterministic Track-2 upgrade — needs you to pick a target voice). Say the word and I'll build samples of those too.

★ Closest yet — local, tone-matched to Adobe

The AudioSR graft was clean but ~15 dB too bright. So I measured Adobe's actual tone curve and reshaped the local version to match it — borrowing Adobe's tonal balance without using Adobe's audio. These are 100% local (his real voice + corrected AudioSR highs, EQ'd to Adobe's profile). Start your morning here and A/B against the green Adobe card up top.
★ BEST LOCAL Real voice + AudioSR air, tone-matched to Adobe the one to beat
His untouched voice under 4.5 kHz, corrected AudioSR highs on top, then EQ-shaped to Adobe's measured tone. Fully offline. Does it finally sit next to Adobe?
ALT Full AudioSR v2, tone-matched to Adobe no grafting
Same tone-match, but the whole signal is AudioSR (no real-voice graft). Compare with the graft above — which sounds more like him?
The voice-conversion track (your "doesn't have to be his voice" idea) is built — see Track 2 just below. The artifact-cancelling ensemble is now up top. FlashSR and RVC-with-a-trained-voice are the two remaining levers if these don't clear the bar.

★ Track 2 — keep his vibe, clean new voice (voice conversion)

Your "doesn't have to be his voice — the delivery is what matters" idea, realized. This is voice conversion: his original recording drives every bit of the timing, intonation and announcer cadence, but a clean professional voice provides the timbre. The muffle is gone (the engine resynthesizes the full band from scratch), and the performance is his — the pitch contour and pacing are preserved exactly (same 30.0s length). This is fundamentally different from the failed Option C: corrected engine, corrected settings.
★ VOICE CONVERSION His delivery → clean pro voice (seed-vc V2) his cadence, new timbre
His exact performance, spoken in a clean studio voice. Listen for whether the announcer energy and pacing survive — that's the whole point. The timbre is a different person; the vibe should be him.
If this lands, the upgrade path is RVC with a trained narrator voice — even cleaner, fully deterministic for the 14h book, and you pick the target voice.

Local contenders — offline, no Adobe account

LOCAL · 1 Resemble-Enhance denoise + generative restore
Strongest general local restorer. Rebuilds natural detail; far less "underwater" than VoiceFixer. Keeps his exact voice. Runs fully offline on your M4.
A/B this against the Adobe winner above — does it hold up?
LOCAL · 2 AudioSR — diffusion super-resolution (speech model) strong top-end rebuild
Diffusion model rebuilds the highs iteratively. Bright and detailed; listen for any faint diffusion "shimmer" on sibilants. Local.
LOCAL · 3 AP-BWE — purpose-built bandwidth extension (12k→48k) gentlest / most conservative
GAN trained to rebuild a band-limited signal to 48 kHz. The most restrained of the three — keeps your real content, adds a modest, clean top end. Local, and very fast (~1.4s for 30s).
How they compare on paper (energy rebuilt above 6 kHz, after matching loudness): Resemble −43 dB (most aggressive) · AudioSR −47 · Adobe −52 (natural) · AP-BWE −55 (gentlest) · raw original −58. More lift isn't automatically better — too much can sound hissy/brittle. Trust your ears; this is just the map.

AudioSR, done right — the research payoff

The researcher confirmed my earlier AudioSR shimmer was a usage bug, not the model: I'd fed it 30-second clips when it wants ~5-second windows, pre-lowpassed. Rebuilt it the right way — eight overlapping 5s windows, high diffusion quality, crossfaded. This is the same engine, used correctly.
FIXED AudioSR v2 — windowed correctly fuller, no shimmer
Same model as the artifacty one, run the way it was designed. Came out brighter than Adobe — judge whether that reads as detailed or a touch hot.
FIXED + GRAFT Real voice + AudioSR v2 air our best shot yet
The two ideas combined: his real, untouched voice below 4.5 kHz, plus the corrected AudioSR's clean highs on top. This is the most complete "our own" attempt — A/B it hard against Adobe.

Making it work on our own — the grafting experiment

The theory: the local engines added artifacts because we let them reprocess his whole voice — including the 0–4.5 kHz body that was already fine. So instead: keep his real, untouched audio below 4.5 kHz and graft on only the synthetic "air" above it. His actual timbre is never touched; the engine only fills the missing top. If the theory holds, these should sound like him — cleaner, with the metallic/shimmer junk gone from his voice.
GRAFT Real voice + AP-BWE air most conservative top
His real audio under 4.5 kHz, AP-BWE's rebuilt highs above. The gentlest of the grafts.
GRAFT Real voice + AudioSR air more presence
His real audio under 4.5 kHz, AudioSR's diffusion highs above. Brighter — listen for shimmer.
GRAFT Real voice + Resemble air brightest top
His real audio under 4.5 kHz, Resemble's restored highs above. The most aggressive top end.
These are the first attempt at the "our own" path. If one is close, the next moves are: run the source engines in their ideal short windows (kills more artifacts), bring in stronger models a researcher is scouting right now (Apollo, SGMSE+, open Miipher), and ensemble to cancel what's left. Tell me which graft is closest to the Adobe sound and I'll push that direction.

Your idea — resynthesis (know the words + the delivery, rebuild clean)

NEW · RESYNTHESIS IndexTTS-2 — his cloned voice re-reads it, clean fully synthetic
This is the "fill in the blanks with what we know he's saying" idea. We cloned his voice from the Adobe-cleaned clip, fed it the transcript, and steered the tone with his original recording. The output is brand-new, perfectly clean audio — but it's a clone re-performing, not his real waveform. Listen hard: does it sound like him, or like a very good impostor?
The honest tradeoff: this gives perfect fidelity but loses his real voice, and any transcription slip becomes a wrong word. Over 14.5 h that's the risk. Bandwidth-extension (Adobe / the locals above) keeps his actual voice, just clearer.

You can see the fix too

original spectrogram
Before — energy dies above ~5 kHz (dark top band).
adobe spectrogram
After Adobe v2 — top end rebuilt with natural detail.

Next — the full book

Doing all 14.5 h by hand on the website is painful (free tier caps daily minutes; long-file upload is paid). The clean path is Adobe's Enhance API — same engine, batched:
  1. Split the .m4b at its existing chapter marks into ~30–60 min WAV chunks.
  2. Run each chunk through the Enhance API (the exact v2 you just approved).
  3. Loudness-normalize to the −18 LUFS audiobook standard and reassemble.
  4. Re-wrap into a new .m4b with chapters + cover art preserved.
I'll confirm the API access + cost before running anything — you approve the plan first.

Already tried — you rejected these

Expand to re-hear the four that didn't land
A · NODSP remaster (EQ)"same recording, brighter"
B · NOVoiceFixer AI restore"underwater / smeared"
C · NOVoice conversion"AI-processed"
D · NOFull TTS re-narration"lost the human"