Meta AI Releases SAM Audio: A State-of-the-Art Unified Model that Uses Intuitive and Multimodal Prompts for Audio Separation

Meta has released SAM Audio, a prompt-driven audio separation model that targets a common editing bottleneck, isolating one sound from a real-world mix without building a custom model per sound class. Meta released 3 main sizes: sam-audio-small, sam-audio-base, and sam-audio-large. The model is available to download and experiment with in the Segment Anything Playground.

Architecture

SAM Audio uses separate encoders for each conditioning signal—an audio encoder for the mixture, a text encoder for the natural language description, a span encoder for time anchors, and a visual encoder that consumes a visual prompt derived from video plus an object mask. The encoded streams are concatenated into time-aligned features, which are then processed by a diffusion transformer that applies self-attention over the time-aligned representation and cross-attention to the textual feature. A DACVAE decoder reconstructs waveforms and emits two outputs: target audio and residual audio.

What SAM Audio does, and what ‘segment’ means here?

SAM Audio takes an input recording that contains multiple overlapping sources, like speech plus traffic plus music, and separates a target source based on a prompt. In the public inference API, the model produces two outputs: result.target (the isolated sound) and result.residual (everything else).

This target-residual interface maps directly to editor operations. For instance, to remove a dog bark from a podcast track, treat the bark as the target and keep only the residual. Conversely, if you want to extract a guitar part from a concert clip, you keep the target waveform instead. Meta uses these examples to illustrate the model’s potential.

The 3 prompt types Meta is shipping

Meta positions SAM Audio as a single unified model supporting three prompt types, usable alone or in combination:

Text prompting: Describe the sound in natural language, e.g., “dog barking” or “singing voice,” and the model separates that sound from the mixture. Text prompts are a core interaction mode, with an end-to-end example available in the open-source repo using SAMAudioProcessor and model.separate.

Visual prompting: Click on a person or object in a video to ask the model to isolate the audio linked to that visual object, implemented by passing video frames and masks into the processor via masked_videos.

Span prompting: Mark time segments where the target sound occurs; the model uses those spans to guide separation. This is crucial for ambiguous cases, such as when the same instrument appears multiple times or when a sound is brief, helping to prevent over-separation.

Results

The Meta team claims SAM Audio achieves cutting-edge performance across diverse, real-world scenarios and serves as a unified alternative to single-purpose audio tools. They published a subjective evaluation across categories—General, SFX, Speech, Speaker, Music, Instr(wild), Instr(pro)—with General scores of 3.62 for sam audio small, 3.28 for sam audio base, and 3.50 for sam audio large, while Instr(pro) scores reached 4.49 for sam audio large.

Key Takeaways

SAM Audio is a unified audio separation model that segments sound from complex mixtures using text prompts, visual prompts, and time span prompts.

The core API produces two waveforms per request: target for the isolated sound and residual for everything else, easily mapping to common edit operations like removing noise, extracting stems, or keeping ambience.

Meta released multiple checkpoints and variants, including sam-audio-small, sam-audio-base, sam-audio-large, plus TV variants that perform better for visual prompting. The repo also includes a subjective evaluation table by category.

The release includes tooling beyond inference: Meta provides a sam-audio-judge model that scores separation results against a text description, evaluating overall quality, recall, precision, and faithfulness.

Meta AI Releases SAM Audio: A State-of-the-Art Unified Model that Uses Intuitive and Multimodal Prompts for Audio Separation

Making the case for curiosity-driven science | MIT News

IBM launches AI platform Bob to regulate SDLC costs

Build a Reinforcement Learning Powered Agent that Learns to Retrieve Relevant Long-Term Memories for Accurate LLM Question Answering

The Most Efficient Approach to Crafting Your Personal AI Productivity System

Thanks! We sent confirmation message to your inbox.

rewrite this title in other words: Visa is quietly building stablecoins into mainstream payment plumbing without you knowing

rewrite this title in other words: ew Ledger Scan Shows How Much XRP Is Quantum-Exposed

rewrite this title in other words: Ethereum Pulls $1B in Buy Volume on Binance as ETH Drops Below $2,300 Amid Fed Rate Hold

rewrite this title in other words: Dollar Weakens as Japan Intervenes in Forex Market to Support the Yen

Making the case for curiosity-driven science | MIT News

Top Insights

DeFi’s Lose-Lose Problem on Freezing Stolen Funds

#1 Business Idea to Make Money with AI

Meta AI Releases SAM Audio: A State-of-the-Art Unified Model that Uses Intuitive and Multimodal Prompts for Audio Separation

Architecture

What SAM Audio does, and what ‘segment’ means here?

The 3 prompt types Meta is shipping

Results

Key Takeaways

Related Posts