Close Menu
    Facebook X (Twitter) Instagram
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Facebook X (Twitter) Instagram
    Fintech Fetch
    • Home
    • Crypto News
      • Bitcoin
      • Ethereum
      • Altcoins
      • Blockchain
      • DeFi
    • AI News
    • Stock News
    • Learn
      • AI for Beginners
      • AI Tips
      • Make Money with AI
    • Reviews
    • Tools
      • Best AI Tools
      • Crypto Market Cap List
      • Stock Market Overview
      • Market Heatmap
    • Contact
    Fintech Fetch
    Home»AI News»Meta AI Releases SAM Audio: A State-of-the-Art Unified Model that Uses Intuitive and Multimodal Prompts for Audio Separation
    Meta AI Releases SAM Audio: A State-of-the-Art Unified Model that Uses Intuitive and Multimodal Prompts for Audio Separation
    AI News

    Meta AI Releases SAM Audio: A State-of-the-Art Unified Model that Uses Intuitive and Multimodal Prompts for Audio Separation

    December 17, 20253 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email
    binance

    Meta has released SAM Audio, a prompt-driven audio separation model that targets a common editing bottleneck, isolating one sound from a real-world mix without building a custom model per sound class. Meta released 3 main sizes: sam-audio-small, sam-audio-base, and sam-audio-large. The model is available to download and experiment with in the Segment Anything Playground.

    Architecture

    SAM Audio uses separate encoders for each conditioning signal—an audio encoder for the mixture, a text encoder for the natural language description, a span encoder for time anchors, and a visual encoder that consumes a visual prompt derived from video plus an object mask. The encoded streams are concatenated into time-aligned features, which are then processed by a diffusion transformer that applies self-attention over the time-aligned representation and cross-attention to the textual feature. A DACVAE decoder reconstructs waveforms and emits two outputs: target audio and residual audio.

    What SAM Audio does, and what ‘segment’ means here?

    SAM Audio takes an input recording that contains multiple overlapping sources, like speech plus traffic plus music, and separates a target source based on a prompt. In the public inference API, the model produces two outputs: result.target (the isolated sound) and result.residual (everything else).

    This target-residual interface maps directly to editor operations. For instance, to remove a dog bark from a podcast track, treat the bark as the target and keep only the residual. Conversely, if you want to extract a guitar part from a concert clip, you keep the target waveform instead. Meta uses these examples to illustrate the model’s potential.

    The 3 prompt types Meta is shipping

    Meta positions SAM Audio as a single unified model supporting three prompt types, usable alone or in combination:

    murf
  • Text prompting: Describe the sound in natural language, e.g., “dog barking” or “singing voice,” and the model separates that sound from the mixture. Text prompts are a core interaction mode, with an end-to-end example available in the open-source repo using SAMAudioProcessor and model.separate.
  • Visual prompting: Click on a person or object in a video to ask the model to isolate the audio linked to that visual object, implemented by passing video frames and masks into the processor via masked_videos.
  • Span prompting: Mark time segments where the target sound occurs; the model uses those spans to guide separation. This is crucial for ambiguous cases, such as when the same instrument appears multiple times or when a sound is brief, helping to prevent over-separation.
  • Results

    The Meta team claims SAM Audio achieves cutting-edge performance across diverse, real-world scenarios and serves as a unified alternative to single-purpose audio tools. They published a subjective evaluation across categories—General, SFX, Speech, Speaker, Music, Instr(wild), Instr(pro)—with General scores of 3.62 for sam audio small, 3.28 for sam audio base, and 3.50 for sam audio large, while Instr(pro) scores reached 4.49 for sam audio large.

    Key Takeaways

  • SAM Audio is a unified audio separation model that segments sound from complex mixtures using text prompts, visual prompts, and time span prompts.
  • The core API produces two waveforms per request: target for the isolated sound and residual for everything else, easily mapping to common edit operations like removing noise, extracting stems, or keeping ambience.
  • Meta released multiple checkpoints and variants, including sam-audio-small, sam-audio-base, sam-audio-large, plus TV variants that perform better for visual prompting. The repo also includes a subjective evaluation table by category.
  • The release includes tooling beyond inference: Meta provides a sam-audio-judge model that scores separation results against a text description, evaluating overall quality, recall, precision, and faithfulness.
  • coinbase
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Fintech Fetch Editorial Team
    • Website

    Related Posts

    Anthropic and OpenAI just exposed SAST's structural blind spot with free tools

    Anthropic and OpenAI just exposed SAST’s structural blind spot with free tools

    March 11, 2026
    Why AI insurance underwriting is finally attracting institutional capital

    Why AI insurance underwriting is finally attracting institutional capital

    March 10, 2026
    logo

    Pay for the data you’re using

    March 8, 2026
    A “ChatGPT for spreadsheets” helps solve difficult engineering challenges faster | MIT News

    A “ChatGPT for spreadsheets” helps solve difficult engineering challenges faster | MIT News

    March 7, 2026
    Add A Comment

    Comments are closed.

    Join our email newsletter and get news & updates into your inbox for free.


    Privacy Policy

    Thanks! We sent confirmation message to your inbox.

    frase
    Latest Posts
    Nigel Farage Invests in UK Bitcoin Firm Led by Former Chancellor Kwasi Kwarteng

    Nigel Farage Supports UK Bitcoin Company Managed by Ex-Chancellor Kwasi Kwarteng

    March 10, 2026
    Shiba Inu

    Shiba Inu Whales Are Active Once More—But Where Are They Heading?

    March 10, 2026
    US Treasury signals regulated crypto privacy may have a future in the US

    US Treasury indicates that regulated crypto privacy could have a place in the US.

    March 10, 2026
    Sharplink Posts $734M Loss Despite Higher Staking Income

    Sharplink Reports $734M Loss Even with Increased Staking Revenue

    March 10, 2026
    Australian Market Sharply Higher | Nasdaq

    Australian Market Sees Significant Gains | Nasdaq

    March 10, 2026
    coinbase
    LEGAL INFORMATION
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Top Insights
    Anthropic and OpenAI just exposed SAST's structural blind spot with free tools

    Anthropic and OpenAI just exposed SAST’s structural blind spot with free tools

    March 11, 2026
    Complete Data Science & AI/ML Course for Job Market | Prime 2.0 QnA & FAQ's

    Complete Data Science & AI/ML Course for Job Market | Prime 2.0 QnA & FAQ’s

    March 10, 2026
    synthesia
    Facebook X (Twitter) Instagram Pinterest
    © 2026 FintechFetch.com - All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.