Close Menu
    Facebook X (Twitter) Instagram
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Facebook X (Twitter) Instagram
    Fintech Fetch
    • Home
    • Crypto News
      • Bitcoin
      • Ethereum
      • Altcoins
      • Blockchain
      • DeFi
    • AI News
    • Stock News
    • Learn
      • AI for Beginners
      • AI Tips
      • Make Money with AI
    • Reviews
    • Tools
      • Best AI Tools
      • Crypto Market Cap List
      • Stock Market Overview
      • Market Heatmap
    • Contact
    Fintech Fetch
    Home»AI News»MIT’s new ‘recursive’ framework lets LLMs process 10 million tokens without context rot
    MIT’s new ‘recursive’ framework lets LLMs process 10 million tokens without context rot
    AI News

    MIT’s new ‘recursive’ framework lets LLMs process 10 million tokens without context rot

    January 20, 20266 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email
    aistudios

    Recursive language models (RLMs) are an inference technique developed by researchers at MIT CSAIL that treat long prompts as an external environment to the model. Instead of forcing the entire prompt into the model’s context window, the framework allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the text. Rather than expanding context windows or summarizing old information, the MIT team reframes long-context reasoning as a systems problem. By letting models treat prompts as something they can inspect with code, recursive language models allow LLMs to reason over millions of tokens without retraining. This offers enterprises a practical path to long-horizon tasks like codebase analysis, legal review, and multi-step reasoning that routinely break today’s models. Because the framework is designed as a wrapper around existing models, it can serve as a drop-in replacement for applications that make direct calls to LLMs.

    The LLM context problem

    While frontier models are becoming increasingly sophisticated at reasoning, their ability to process massive amounts of information is not scaling at the same rate. This bottleneck is driven by two distinct limitations: the hard physical constraint on how much text a model can process at once (context length) and “context rot.” The challenge, the researchers argue, is whether it’s possible to scale the effective context size of general-purpose LLMs by orders of magnitude without retraining them. This capability is becoming increasingly important for enterprise applications, where LLMs are adopted for long-horizon tasks requiring the processing of millions of tokens — a challenge Zhang argues can’t be solved by simply expanding context windows. “There is an entropy argument that implies you need exponentially more data samples as you increase the effective context window size,” Alex Zhang, a co-author of the paper, told VentureBeat. Current approaches to extending context often rely on compaction, where the model summarizes older parts of the conversation to free up space. However, this method fails for tasks requiring random access to specific details located in earlier parts of the prompt.

    How RLMs work

    The concept behind RLMs is drawn from “out-of-core” algorithms used in classical computing. These algorithms are designed to process datasets too large to fit into a computer’s main memory by keeping the data on a hard drive and fetching only the necessary chunks as needed. RLMs apply this logic to generative AI. Instead of feeding a long prompt directly into the neural network, the framework loads the text as a string variable inside a Python coding environment. The LLM is given general context about the data (such as the total character count) but does not “see” the text initially. Once the prompt is stored as a variable, the LLM acts as a programmer. It writes Python code to interact with the external variable, using standard commands to peek into the data. For example, the model might use regular expressions to search for specific keywords like “Chapter 1” or “financial results.” When the code execution finds a relevant snippet, the RLM pulls only that specific chunk into its active context window for analysis. For example, if the prompt is a massive book, the LLM might write a loop that identifies chapter boundaries and then triggers a sub-call to summarize each chapter individually.

    bybit

    RLM architecture

    The architecture typically involves two agents. A “root language model,” often a capability-heavy model like GPT-5, acts as the orchestrator. It plans the approach, writes the code, and manages the data flow within the REPL environment. A “recursive language model,” often a faster and cheaper model, acts as the worker. The root LM calls this worker to process the specific text snippets isolated by the code. Because the prompt resides in the environment’s memory rather than the model’s context window, the system can handle inputs far larger than the model’s training limit. Importantly, to the end-user, the RLM behaves exactly like a standard model: It accepts a string and returns an answer. This allows enterprise teams to swap standard API calls for RLMs. For developers looking to experiment, the RLM code is currently available on GitHub. “A key argument for RLMs is that most complex tasks can be decomposed into smaller, ‘local’ sub-tasks,” Zhang said. “However, how to perform this context/problem decomposition is non-trivial, and the model must be capable of performing this.”

    RLMs in action

    To validate the framework, the researchers tested RLMs against base models and other agentic approaches like CodeAct and summary agents across a variety of long-context tasks, including retrieval and multi-hop question answering. The results demonstrated strong performance gains at the 10 million+ token scale. On BrowseComp-Plus, a benchmark involving inputs of 6 to 11 million tokens, standard base models failed completely, scoring 0%. In contrast, the RLM powered by GPT-5 achieved a score of 91.33%, significantly outperforming the Summary Agent (70.47%) and CodeAct (51%). The framework also excelled at tasks with high computational complexity. On OOLONG-Pairs, an information-dense reasoning benchmark where the difficulty scales quadratically with input length, base GPT-5 models failed catastrophically with a score of just 0.04%. The RLM achieved an F1 score (a balanced measure of precision and recall) of 58%, demonstrating emergent capabilities to handle dense tasks that paralyze standard models. Similarly, on code understanding tasks (CodeQA benchmark), the RLM more than doubled the performance of the base GPT-5 model, jumping from 24% to 62%.

    RLM maintains its performance even after it hits the context window limit of the underlying model. Regarding the context rot problem, the data showed that while the base GPT-5 performance degrades rapidly as task complexity increases, RLM performance holds steady, consistently outperforming the base model on contexts longer than 16,000 tokens. Despite the increased complexity of the workflow, RLMs often maintained comparable or lower average costs than the baselines. On the BrowseComp-Plus benchmark, the RLM was up to three times cheaper than the summarization baseline. However, the researchers noted that while median costs are low, RLM trajectories are “long-tailed.” Outlier runs can become expensive if the model gets stuck in loops or performs redundant verifications. While GPT-5 was conservative in its sub-calls, the open-source Qwen3-Coder model sometimes attempted thousands of sub-calls for simple tasks. “Today, you likely will have to implement your own guardrails and logic to control RLM behavior,” Zhang said. However, he hypothesizes that future models could be trained to manage their own compute budgets more effectively. Companies like Prime Intellect are planning to integrate RLM into the training process of models, possibly addressing the edge cases where the model’s inference budget spikes.

    For enterprise architects deciding where to place their bets, the RLM framework offers a new tool for handling information-dense problems. “I think RLMs are still extremely useful for chatbots (think long chat histories), but ultimately they argue for an alternative way of using LMs,” Zhang said. “I think RLMs work in tandem with standard retrieval methods like RAG; they do not serve as a replacement, and can be used in different settings or together.”

    aistudios
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Fintech Fetch Editorial Team
    • Website

    Related Posts

    Qwen Team Releases Qwen3-Coder-Next: An Open-Weight Language Model Designed Specifically for Coding Agents and Local Development

    Qwen Team Releases Qwen3-Coder-Next: An Open-Weight Language Model Designed Specifically for Coding Agents and Local Development

    February 3, 2026
    How generative AI can help scientists synthesize complex materials | MIT News

    How generative AI can help scientists synthesize complex materials | MIT News

    February 2, 2026
    AI use surges at Travelers as call centre roles reduce

    AI use surges at Travelers as call centre roles reduce

    January 31, 2026
    A Coding Implementation to Training, Optimizing, Evaluating, and Interpreting Knowledge Graph Embeddings with PyKEEN

    A Coding Implementation to Training, Optimizing, Evaluating, and Interpreting Knowledge Graph Embeddings with PyKEEN

    January 30, 2026
    Add A Comment

    Comments are closed.

    Join our email newsletter and get news & updates into your inbox for free.


    Privacy Policy

    Thanks! We sent confirmation message to your inbox.

    notion
    Latest Posts
    Nearing Retirement? 4 Ways to Catch Up on Savings if You're Behind.

    Approaching Retirement? 4 Strategies to Boost Your Savings if You’re Lagging.

    February 3, 2026
    Qwen Team Releases Qwen3-Coder-Next: An Open-Weight Language Model Designed Specifically for Coding Agents and Local Development

    Qwen Team Releases Qwen3-Coder-Next: An Open-Weight Language Model Designed Specifically for Coding Agents and Local Development

    February 3, 2026
    How To Build An AI Business For $1 In 2026

    How To Build An AI Business For $1 In 2026

    February 3, 2026
    How to Make Animated Cartoon videos with AI (Full Course)

    How to Make Animated Cartoon videos with AI (Full Course)

    February 3, 2026
    How to Use AI to Make Money, Save Time, and Be More Productive

    How to Use AI to Make Money, Save Time, and Be More Productive

    February 3, 2026
    livechat
    LEGAL INFORMATION
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Top Insights
    How World Liberty’s $3.4B USD1 Stablecoin Powers Onchain Lending Markets

    How World Liberty’s $3.4B USD1 Stablecoin Powers Onchain Lending Markets

    February 4, 2026
    Solana (SOL) Hovers Near $100 as Long-Term Holders Pull Back — Downside Risk Builds

    Why These Three Altcoins Could Cause Significant Liquidations This Week

    February 3, 2026
    kraken
    Facebook X (Twitter) Instagram Pinterest
    © 2026 FintechFetch.com - All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.