Meta has released SAM Audio, a prompt-driven audio separation model that targets a common editing bottleneck, isolating one sound from a real-world mix without building a custom model per sound class. Meta released 3 main sizes: sam-audio-small, sam-audio-base, and sam-audio-large. The model is available to download and experiment with in the Segment Anything Playground.
Architecture
SAM Audio uses separate encoders for each conditioning signal—an audio encoder for the mixture, a text encoder for the natural language description, a span encoder for time anchors, and a visual encoder that consumes a visual prompt derived from video plus an object mask. The encoded streams are concatenated into time-aligned features, which are then processed by a diffusion transformer that applies self-attention over the time-aligned representation and cross-attention to the textual feature. A DACVAE decoder reconstructs waveforms and emits two outputs: target audio and residual audio.
What SAM Audio does, and what ‘segment’ means here?
SAM Audio takes an input recording that contains multiple overlapping sources, like speech plus traffic plus music, and separates a target source based on a prompt. In the public inference API, the model produces two outputs: result.target (the isolated sound) and result.residual (everything else).
This target-residual interface maps directly to editor operations. For instance, to remove a dog bark from a podcast track, treat the bark as the target and keep only the residual. Conversely, if you want to extract a guitar part from a concert clip, you keep the target waveform instead. Meta uses these examples to illustrate the model’s potential.
The 3 prompt types Meta is shipping
Meta positions SAM Audio as a single unified model supporting three prompt types, usable alone or in combination:
Results
The Meta team claims SAM Audio achieves cutting-edge performance across diverse, real-world scenarios and serves as a unified alternative to single-purpose audio tools. They published a subjective evaluation across categories—General, SFX, Speech, Speaker, Music, Instr(wild), Instr(pro)—with General scores of 3.62 for sam audio small, 3.28 for sam audio base, and 3.50 for sam audio large, while Instr(pro) scores reached 4.49 for sam audio large.







