Close Menu
    Facebook X (Twitter) Instagram
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Facebook X (Twitter) Instagram
    Fintech Fetch
    • Home
    • Crypto News
      • Bitcoin
      • Ethereum
      • Altcoins
      • Blockchain
      • DeFi
    • AI News
    • Stock News
    • Learn
      • AI for Beginners
      • AI Tips
      • Make Money with AI
    • Reviews
    • Tools
      • Best AI Tools
      • Crypto Market Cap List
      • Stock Market Overview
      • Market Heatmap
    • Contact
    Fintech Fetch
    Home»AI News»Meta AI Releases SAM Audio: A State-of-the-Art Unified Model that Uses Intuitive and Multimodal Prompts for Audio Separation
    Meta AI Releases SAM Audio: A State-of-the-Art Unified Model that Uses Intuitive and Multimodal Prompts for Audio Separation
    AI News

    Meta AI Releases SAM Audio: A State-of-the-Art Unified Model that Uses Intuitive and Multimodal Prompts for Audio Separation

    December 17, 20253 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email
    changelly

    Meta has released SAM Audio, a prompt-driven audio separation model that targets a common editing bottleneck, isolating one sound from a real-world mix without building a custom model per sound class. Meta released 3 main sizes: sam-audio-small, sam-audio-base, and sam-audio-large. The model is available to download and experiment with in the Segment Anything Playground.

    Architecture

    SAM Audio uses separate encoders for each conditioning signal—an audio encoder for the mixture, a text encoder for the natural language description, a span encoder for time anchors, and a visual encoder that consumes a visual prompt derived from video plus an object mask. The encoded streams are concatenated into time-aligned features, which are then processed by a diffusion transformer that applies self-attention over the time-aligned representation and cross-attention to the textual feature. A DACVAE decoder reconstructs waveforms and emits two outputs: target audio and residual audio.

    What SAM Audio does, and what ‘segment’ means here?

    SAM Audio takes an input recording that contains multiple overlapping sources, like speech plus traffic plus music, and separates a target source based on a prompt. In the public inference API, the model produces two outputs: result.target (the isolated sound) and result.residual (everything else).

    This target-residual interface maps directly to editor operations. For instance, to remove a dog bark from a podcast track, treat the bark as the target and keep only the residual. Conversely, if you want to extract a guitar part from a concert clip, you keep the target waveform instead. Meta uses these examples to illustrate the model’s potential.

    The 3 prompt types Meta is shipping

    Meta positions SAM Audio as a single unified model supporting three prompt types, usable alone or in combination:

    aistudios
  • Text prompting: Describe the sound in natural language, e.g., “dog barking” or “singing voice,” and the model separates that sound from the mixture. Text prompts are a core interaction mode, with an end-to-end example available in the open-source repo using SAMAudioProcessor and model.separate.
  • Visual prompting: Click on a person or object in a video to ask the model to isolate the audio linked to that visual object, implemented by passing video frames and masks into the processor via masked_videos.
  • Span prompting: Mark time segments where the target sound occurs; the model uses those spans to guide separation. This is crucial for ambiguous cases, such as when the same instrument appears multiple times or when a sound is brief, helping to prevent over-separation.
  • Results

    The Meta team claims SAM Audio achieves cutting-edge performance across diverse, real-world scenarios and serves as a unified alternative to single-purpose audio tools. They published a subjective evaluation across categories—General, SFX, Speech, Speaker, Music, Instr(wild), Instr(pro)—with General scores of 3.62 for sam audio small, 3.28 for sam audio base, and 3.50 for sam audio large, while Instr(pro) scores reached 4.49 for sam audio large.

    Key Takeaways

  • SAM Audio is a unified audio separation model that segments sound from complex mixtures using text prompts, visual prompts, and time span prompts.
  • The core API produces two waveforms per request: target for the isolated sound and residual for everything else, easily mapping to common edit operations like removing noise, extracting stems, or keeping ambience.
  • Meta released multiple checkpoints and variants, including sam-audio-small, sam-audio-base, sam-audio-large, plus TV variants that perform better for visual prompting. The repo also includes a subjective evaluation table by category.
  • The release includes tooling beyond inference: Meta provides a sam-audio-judge model that scores separation results against a text description, evaluating overall quality, recall, precision, and faithfulness.
  • frase
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Fintech Fetch Editorial Team
    • Website

    Related Posts

    Making the case for curiosity-driven science | MIT News

    Making the case for curiosity-driven science | MIT News

    May 1, 2026
    IBM launches AI platform Bob to regulate SDLC costs

    IBM launches AI platform Bob to regulate SDLC costs

    April 29, 2026
    Build a Reinforcement Learning Powered Agent that Learns to Retrieve Relevant Long-Term Memories for Accurate LLM Question Answering

    Build a Reinforcement Learning Powered Agent that Learns to Retrieve Relevant Long-Term Memories for Accurate LLM Question Answering

    April 28, 2026
    logo

    The Most Efficient Approach to Crafting Your Personal AI Productivity System

    April 27, 2026
    Add A Comment

    Comments are closed.

    Join our email newsletter and get news & updates into your inbox for free.


    Privacy Policy

    Thanks! We sent confirmation message to your inbox.

    quillbot
    Latest Posts
    Visa payment terminal sits above stablecoin rails as digital dollars move through mainstream payment plumbing

    rewrite this title in other words: Visa is quietly building stablecoins into mainstream payment plumbing without you knowing

    May 1, 2026
    New Ledger Scan Shows How Much XRP Is Quantum-Exposed

    rewrite this title in other words: ew Ledger Scan Shows How Much XRP Is Quantum-Exposed

    May 1, 2026
    LayerZero Pledges 10,000 ETH to DeFi United as Industry Rallies Behind Kelp DAO Recovery

    rewrite this title in other words: Ethereum Pulls $1B in Buy Volume on Binance as ETH Drops Below $2,300 Amid Fed Rate Hold

    May 1, 2026
    Dollar Weakens as Japan Intervenes in Forex Market to Support the Yen

    rewrite this title in other words: Dollar Weakens as Japan Intervenes in Forex Market to Support the Yen

    May 1, 2026
    Making the case for curiosity-driven science | MIT News

    Making the case for curiosity-driven science | MIT News

    May 1, 2026
    aistudios
    LEGAL INFORMATION
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Top Insights
    Cointelegraph

    DeFi’s Lose-Lose Problem on Freezing Stolen Funds

    May 1, 2026
    #1 Business Idea to Make Money with AI

    #1 Business Idea to Make Money with AI

    May 1, 2026
    coinbase
    Facebook X (Twitter) Instagram Pinterest
    © 2026 FintechFetch.com - All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.