Close Menu
    Facebook X (Twitter) Instagram
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Facebook X (Twitter) Instagram
    Fintech Fetch
    • Home
    • Crypto News
      • Bitcoin
      • Ethereum
      • Altcoins
      • Blockchain
      • DeFi
    • AI News
    • Stock News
    • Learn
      • AI for Beginners
      • AI Tips
      • Make Money with AI
    • Reviews
    • Tools
      • Best AI Tools
      • Crypto Market Cap List
      • Stock Market Overview
      • Market Heatmap
    • Contact
    Fintech Fetch
    Home»AI News»Google DeepMind Introduces SIMA 2, A Gemini Powered Generalist Agent For Complex 3D Virtual Worlds
    Google DeepMind Introduces SIMA 2, A Gemini Powered Generalist Agent For Complex 3D Virtual Worlds
    AI News

    Google DeepMind Introduces SIMA 2, A Gemini Powered Generalist Agent For Complex 3D Virtual Worlds

    November 17, 20256 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email
    Customgpt

    Google DeepMind has released SIMA 2 to test how far generalist embodied agents can go inside complex 3D game worlds. SIMA’s (Scalable Instructable Multiworld Agent) new version upgrades the original instruction follower into a Gemini driven system that reasons about goals, explains its plans, and improves from self play in many different environments.

    From SIMA 1 to SIMA 2

    The first SIMA, released in 2024, learned more than 600 language following skills such as ‘turn left’, ‘climb the ladder’, and ‘open the map’. It controlled commercial games only from rendered pixels and a virtual keyboard and mouse, without any access to game internals. On complex tasks, DeepMind reported a SIMA 1 success rate of about 31 percent, while human players reached about 71 percent on the same benchmark.

    SIMA 2 keeps the same embodied interface but replaces the core policy with a Gemini model. According to a TechCrunch article, the system uses Gemini 2.5 Flash Lite as the reasoning engine. This changes SIMA from a direct mapping between pixels and actions into an agent that forms an internal plan, reasons in language, and then executes the necessary action sequence in the game. DeepMind describes this as moving from an instruction follower to an interactive gaming companion that collaborates with the player.

    https://deepmind.google/blog/sima-2-an-agent-that-plays-reasons-and-learns-with-you-in-virtual-3d-worlds/

    Architecture, Gemini in the control loop

    The SIMA 2 architecture integrates Gemini as the agent core. The model receives visual observations and user instructions, infers a high level goal, and produces actions that are sent through the virtual keyboard and mouse interface. Training uses a mix of human demonstration videos with language labels and labels generated by Gemini itself. This supervision lets the agent align its internal reasoning with both human intent and model generated descriptions of behavior.

    Because of this training scheme, SIMA 2 can explain what it intends to do and list the steps it will take. In practice, this means the agent can answer questions about its current objective, justify its decisions, and expose an interpretable chain of thought about the environment.

    coinbase

    Generalization and performance

    The task completion plot shows SIMA 1 at about 31% and SIMA 2 at 62% on the main evaluation suite, with humans around the 70% range. Integrating Gemini doubles the performance of the original agent on complex tasks. The important point is not the exact number, but the shape; the new agent closes most of the measured gap between SIMA 1 and human players on long, language specified missions in the training games.

    On held-out games such as ASKA and MineDojo, which are never seen during training, the DeepMind team shows a similar pattern. SIMA 2 has much higher task completion than SIMA 1 in these environments, indicating real gain in zero shot generalization rather than overfitting to a fixed game set. The agent also transfers abstract concepts, for example, it can reuse an understanding of ‘mining’ in one title when it is asked to ‘harvest’ in another.

    Multimodal instructions

    SIMA 2 extends the instruction channel beyond plain text. The DeepMind demonstrations show the agent following spoken commands, reacting to sketches drawn on the screen, and executing tasks from prompts that use only emojis. In one example, the user asks SIMA 2 to go to ‘the house that is the color of a ripe tomato’. The Gemini core reasons that ripe tomatoes are red, then selects and walks to the red house.

    Gemini also enables instruction following in multiple natural languages and supports mixed prompts where language and visual cues are combined. For physical AI, robotics developers, this is a concrete multimodal stack; a shared representation links text, audio, images, and in-game actions, and the agent uses this representation to ground abstract symbols in concrete control sequences.

    Self improvement at scale

    One of the main research contributions in SIMA 2 is the explicit self-improvement loop. After an initial phase that uses human gameplay as a baseline, the team moves the agent into new games and lets it learn only from its own experience. A separate Gemini model generates new tasks for the agent in each world, and a reward model scores each attempt.

    These trajectories are stored in a bank of self-generated data. Later generations of SIMA 2 use this data during training, which allows the agent to succeed on tasks where earlier generations failed, without any fresh human demonstrations. This is a concrete example of a multitask, model in the loop data engine, where a language model specifies goals and gives feedback, and the agent converts that feedback into new competent policies.

    Genie 3 worlds

    To push generalization further, DeepMind combines SIMA 2 with Genie 3, a world model that generates interactive 3D environments from a single image or text prompt. In these virtual worlds, the agent has to orient itself, parse instructions, and act toward goals even though the geometry and assets differ from all training games.

    The reported behavior is that SIMA 2 can navigate these Genie 3 scenes, identify objects such as benches and trees, and perform requested actions in a coherent way. This is important for researchers; it shows that a single agent can operate across commercial titles and generated environments, using the same reasoning core and control interface.

    Key Takeaways

  • Gemini centered architecture: SIMA 2 integrates Gemini, reported as Gemini 2.5 Flash Lite, as the core reasoning and planning module, wrapped by a visuomotor control stack that acts from pixels through a virtual keyboard and mouse across many commercial games.
  • Measured performance jump over SIMA 1: On DeepMind’s main task suite, SIMA 2 roughly doubles SIMA 1’s 31 percent task completion rate and approaches human level performance in training games, while also delivering significantly higher success rates on held-out environments such as ASKA and MineDojo.
  • Multimodal, compositional instruction following: The agent can follow long, compositional instructions and supports multimodal prompts, including speech, sketches, and emojis, by grounding language and symbols in a shared representation over visual observations and in-game actions.
  • Self improvement via model generated tasks and rewards: SIMA 2 uses a Gemini based teacher to generate tasks and a learned reward model to score trajectories, building a growing experience bank that allows later generations of the agent to outperform earlier ones without additional human demonstrations.
  • Stress testing with Genie 3 and implications for robotics: Coupling SIMA 2 with Genie 3, which synthesizes interactive 3D environments from images or text, shows that the agent can transfer skills to newly generated worlds, supporting DeepMind’s claim that this stack is a concrete step toward general purpose embodied agents and, eventually, more capable real-world robots.
  • SIMA 2 is a meaningful systems milestone rather than a simple benchmark win. By embedding a trimmed Gemini 2.5 Flash Lite model at the core, the DeepMind team demonstrates a practical recipe that joins multimodal perception, language based planning, and a Gemini orchestrated self-improving loop, validated both in commercial games and Genie 3 generated environments. Overall, SIMA 2 shows how an embodied Gemini stack can act as a realistic precursor for general purpose robotic agents.

    aistudios
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Fintech Fetch Editorial Team
    • Website

    Related Posts

    Google Colab Integrates KaggleHub for One Click Access to Kaggle Datasets, Models and Competitions

    Google Colab Integrates KaggleHub for One Click Access to Kaggle Datasets, Models and Competitions

    December 7, 2025
    logo

    How Lloyds Banking Group Is Betting Crypto + Code Will Rewrite the Mortgage Rulebook

    December 6, 2025
    MIT researchers “speak objects into existence” using AI and robotics | MIT News

    MIT researchers “speak objects into existence” using AI and robotics | MIT News

    December 5, 2025
    AWS launches Kiro powers with Stripe, Figma, and Datadog integrations for AI-assisted coding

    AWS launches Kiro powers with Stripe, Figma, and Datadog integrations for AI-assisted coding

    December 4, 2025
    Add A Comment

    Comments are closed.

    Join our email newsletter and get news & updates into your inbox for free.


    Privacy Policy

    Thanks! We sent confirmation message to your inbox.

    livechat
    Latest Posts
    HOW TO LEARN AI in 2026 ? (Complete Beginner's Guide)

    HOW TO LEARN AI in 2026 ? (Complete Beginner’s Guide)

    December 7, 2025
    Analyst Says MSTR Could Jump by Over 45% on Any Bitcoin Breakout

    Analyst Predicts MSTR Could Surge More Than 45% Following Any Bitcoin Rally

    December 7, 2025
    Calls for Samourai Devs Pardon Grow Louder

    Calls for Samourai Devs Pardon Grow Louder

    December 6, 2025
    480,000,000 DOGE Snapped Up by Whales in 48 Hours: What’s Coming?

    Whales Accumulate 480 Million DOGE in Just 48 Hours: What’s Next?

    December 6, 2025
    crypto

    Polish Legislators Unable to Overturn President’s Veto on Cryptocurrency Legislation — Report

    December 6, 2025
    quillbot
    LEGAL INFORMATION
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Top Insights
    Bitcoin wallets interacting with this specific protocol are now flagged for "high-risk" seizures by compliance algorithms

    Bitcoin wallets using this particular protocol are now marked as “high-risk” for potential seizures by compliance algorithms.

    December 7, 2025
    Could Ethereum’s Shrinking Exchange Supply Trigger a Major Rally?

    Could the Decrease in Ethereum’s Exchange Supply Spark a Significant Price Surge?

    December 7, 2025
    coinbase
    Facebook X (Twitter) Instagram Pinterest
    © 2025 FintechFetch.com - All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.