We introduce Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn from a sequence of language tasks through in-context learning (ICL). We further introduce Task Haystack, an evaluation suite dedicated to assessing and diagnosing how long-context LMs utilizes contexts in Lifelong ICL. When given a task instruction and test inputs, long-context LMs are expected to leverage the relevant demonstrations in the Lifelong ICL prompt, avoid distraction and interference from other tasks, and achieve test accuracies that are not significantly worse than the Single-task ICL baseline.
Task Haystack draws inspiration from the widely-adopted “needle-in-a-haystack” (NIAH) evaluation, but presents new and unique challenges. It demands that models (1) utilize the contexts with deeper understanding, rather than resorting to simple copying and pasting; (2) navigate through long streams of evolving topics and tasks, which closely approximates the complexities of real-world usage of long-context LMs. Additionally, Task Haystack inherits the controllability aspect of NIAH, providing model developers with tools and visualizations to identify model vulnerabilities effectively.
We benchmark 12 long-context LMs using Task Haystack. We find that state-of-the-art closed models such as GPT-4o still struggle in this setting, failing 15% of the cases on average, while all open-weight models we evaluate further lack behind by a large margin, failing up to 61% of the cases. In our controlled analysis, we identify factors such as distraction and recency bias as contributors to these failure cases. Further, we observe declines in performance when task instructions are paraphrased at test time or when ICL demonstrations are repeated excessively, raising concerns about the robustness, instruction understanding, and true context utilization of current long-context LMs.
Lifelong ICL presents long-context LMs with a sequence of tasks, each containing a task instruction and a few demonstrations. At test time, the model is given a seen task instruction and then makes predictions on the test input directly. A long-context LM “passes” the Task Haystack test when its accuracies in Lifelong ICL (Task 1+2+3) are not significantly worse than accuracies of the single-task ICL baseline (Task 2 only).
We evaluate 10 open-weight models and 2 closed models. We visualize Lifelong ICL accuracy and pass rate as a function of single-task ICL accuracy. State-of-the-art closed models (GPT-4o) still struggle in this setting, and all open models we evaluate further lack behind by a large margin.
Model | 0-shot | 16-task 1-shot (4k) | 16-task 2-shot (8k) | 16-task 4-shot (16k) | 16-task 8-shot (32k) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
s-acc | s-acc | l-acc | pass | s-acc | l-acc | pass | s-acc | l-acc | pass | s-acc | l-acc | pass | |
Mistral-7B (32k) | 68.1 | 73.9 | 74.6 | 91.2 | 77.6 | 74.6 | 73.8 | 78.6 | 74.8 | 67.5 | 80.3 | 74.2 | 47.5 |
FILM-7B (32k) | 71.1 | 76.7 | 74.7 | 77.5 | 79.1 | 75.1 | 77.5 | 79.6 | 75.4 | 72.5 | 80.8 | 74.9 | 55.0 |
Llama2-7B (32k) | 61.9 | 69.8 | 63.3 | 77.5 | 72.8 | 64.5 | 53.8 | 75.6 | 63.0 | 41.2 | 78.0 | - | - |
Llama2-7B (80k) | 38.4 | 47.6 | 60.0 | 100.0 | 49.8 | 60.2 | 100.0 | 56.3 | 62.3 | 96.3 | 59.8 | 61.5 | 76.3 |
Llama3-8B (1048k) | 51.2 | 65.5 | 68.1 | 78.8 | 70.0 | 69.1 | 76.2 | 71.5 | 70.1 | 71.3 | 73.6 | 70.1 | 57.5 |
Llama3-70B (1048k) | 60.7 | 79.1 | 72.9 | 68.8 | 79.0 | 74.4 | 50.0 | 80.3 | 75.3 | 57.5 | 81.7 | 75.7 | 51.2 |
Yi-6B (200k) | 51.3 | 70.1 | 57.9 | 61.3 | 73.0 | 58.6 | 51.2 | 75.0 | 58.4 | 43.8 | 75.5 | 57.7 | 38.8 |
Yi-9B (200k) | 57.0 | 74.5 | 71.5 | 71.2 | 77.7 | 72.9 | 71.2 | 78.0 | 72.9 | 63.7 | 80.0 | 72.9 | 47.5 |
Yi-34B (200k) | 63.1 | 74.1 | 71.7 | 62.5 | 74.1 | 72.4 | 60.0 | 76.1 | 72.9 | 63.8 | 78.2 | 72.6 | 53.8 |
Cmd-R-35B (128k) | 65.6 | 73.0 | 74.6 | 81.2 | 75.3 | 75.5 | 61.3 | 78.9 | 68.7 | 58.8 | 80.5 | 75.3 | 41.2 |
GPT-3.5-Turbo (16k) | 78.3 | 81.6 | 76.3 | 73.8 | 82.6 | 79.6 | 71.3 | 83.2 | 79.5 | 62.5 | 81.8 | - | - |
GPT-4o (128k) | 70.7 | 85.8 | 87.4 | 86.3 | 87.0 | 87.8 | 81.3 | 87.0 | 88.4 | 83.8 | 87.5 | 89.1 | 88.8 |
To understand the causes behind the failure cases in Task Haystack, we conduct controlled experiments that isolate factors like recency bias (models favoring information at the end of the context) and distractability (models getting distracted by irrelevant information). The results confirm that both factors contribute to the performance degradation on Task Haystack. Additionally, model performance dropped when instructions were paraphrased at test time and when few-shot ICL demonstrations of a single task were repeated multiple times. These observations highlight the limitations of current long-context models in terms of their robustness, language understanding, and context utilization.
Setting | Input Prompt Example | Controlled Factors | ||
---|---|---|---|---|
Long Ctx. | Distraction | Recency | ||
Baseline (Single-task ICL) | T1 Train T1 Test | ✖️ | ✖️ | ✔ |
Random | Random Text T1 Train T1 Test | ✔ | ✔ | ✔ |
Repeat | T1 Train T1 Train T1 Train T1 Test | ✔ | ✖️ | ✔ |
Repeat+Shuffle | T1 Train 🔀 T1 Train 🔀 T1 Train T1 Test | ✔ | ✖️ | ✔ |
Recall (Lifelong ICL) | T1 Train T2 Train T3 Train T1 Test | ✔ | ✔ | ✖️ |
Replay | T1 Train T2 Train T3 Train T1 Train T1 Test | ✔ | ✔ | ✔ |
Remove | T2 Train T3 Train T1 Test | ✔ | ✔ | N/A |
Paraphrase | T1 Train T2 Train T3 Train 🔁 T1 Test | ✔ | ✔ | ✖️ |