Make Your LLM Fully Utilize the Context

Abstract

While many contemporary large language models (LLMs) can process lengthyinput, they still struggle to fully utilize information within the longcontext, known as the lost-in-the-middle challenge. We hypothesize that itstems from insufficient explicit supervision during the long-context training,which fails to emphasize that any position in a long context can hold crucialinformation. Based on this intuition, our study presents information-intensive(IN2) training, a purely data-driven solution to overcome lost-in-the-middle.Specifically, IN2 training leverages a synthesized long-context question-answerdataset, where the answer requires (1) fine-grained information awareness on ashort segment (~128 tokens) within a synthesized long context (4K-32K tokens),and (2) the integration and reasoning of information from two or more shortsegments. Through applying this information-intensive training on Mistral-7B,we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability ofFILM-7B for utilizing long contexts, we design three probing tasks thatencompass various context styles (document, code, and structured-data context)and information retrieval patterns (forward, backward, and bi-directionalretrieval). The probing results demonstrate that FILM-7B can robustly retrieveinformation from different positions in its 32K context window. Beyond theseprobing tasks, FILM-7B significantly improves the performance on real-worldlong-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), whilemaintaining a comparable performance on short-context tasks (e.g., 59.3->59.2accuracy on MMLU). Github Link: https://github.com/microsoft/FILM.

Quick Read (beta)

loading the full paper ...