Stress-Testing Long-Context Language Models
with Lifelong ICL and Task Haystack

1Tsinghua University, 2University of Southern California
*Equal Contribution

Abstract

We introduce Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn from a sequence of language tasks through in-context learning (ICL). We further introduce Task Haystack, an evaluation suite dedicated to assessing and diagnosing how long-context LMs utilizes contexts in Lifelong ICL. When given a task instruction and test inputs, long-context LMs are expected to leverage the relevant demonstrations in the Lifelong ICL prompt, avoid distraction and interference from other tasks, and achieve test accuracies that are not significantly worse than the Single-task ICL baseline.

Task Haystack draws inspiration from the widely-adopted “needle-in-a-haystack” (NIAH) evaluation, but presents new and unique challenges. It demands that models (1) utilize the contexts with deeper understanding, rather than resorting to simple copying and pasting; (2) navigate through long streams of evolving topics and tasks, which closely approximates the complexities of real-world usage of long-context LMs. Additionally, Task Haystack inherits the controllability aspect of NIAH, providing model developers with tools and visualizations to identify model vulnerabilities effectively.

We benchmark 12 long-context LMs using Task Haystack. We find that state-of-the-art closed models such as GPT-4o still struggle in this setting, failing 15% of the cases on average, while all open-weight models we evaluate further lack behind by a large margin, failing up to 61% of the cases. In our controlled analysis, we identify factors such as distraction and recency bias as contributors to these failure cases. Further, we observe declines in performance when task instructions are paraphrased at test time or when ICL demonstrations are repeated excessively, raising concerns about the robustness, instruction understanding, and true context utilization of current long-context LMs.

Lifelong ICL and Task Haystack

Lifelong ICL and Task Haystack

Lifelong ICL presents long-context LMs with a sequence of tasks, each containing a task instruction and a few demonstrations. At test time, the model is given a seen task instruction and then makes predictions on the test input directly. A long-context LM “passes” the Task Haystack test when its accuracies in Lifelong ICL (Task 1+2+3) are not significantly worse than accuracies of the single-task ICL baseline (Task 2 only).

Result Summary

Main Results

We evaluate 10 open-weight models and 2 closed models. We visualize Lifelong ICL accuracy and pass rate as a function of single-task ICL accuracy. State-of-the-art closed models (GPT-4o) still struggle in this setting, and all open models we evaluate further lack behind by a large margin.

Full Result Table (Scale-Shot Setting)
Model 0-shot 16-task 1-shot (4k) 16-task 2-shot (8k) 16-task 4-shot (16k) 16-task 8-shot (32k)
s-acc s-acc l-acc pass s-acc l-acc pass s-acc l-acc pass s-acc l-acc pass
Mistral-7B (32k) 68.1 73.9 74.6 91.2 77.6 74.6 73.8 78.6 74.8 67.5 80.3 74.2 47.5
FILM-7B (32k) 71.1 76.7 74.7 77.5 79.1 75.1 77.5 79.6 75.4 72.5 80.8 74.9 55.0
Llama2-7B (32k) 61.9 69.8 63.3 77.5 72.8 64.5 53.8 75.6 63.0 41.2 78.0 - -
Llama2-7B (80k) 38.4 47.6 60.0 100.0 49.8 60.2 100.0 56.3 62.3 96.3 59.8 61.5 76.3
Llama3-8B (1048k) 51.2 65.5 68.1 78.8 70.0 69.1 76.2 71.5 70.1 71.3 73.6 70.1 57.5
Llama3-70B (1048k) 60.7 79.1 72.9 68.8 79.0 74.4 50.0 80.3 75.3 57.5 81.7 75.7 51.2
Yi-6B (200k) 51.3 70.1 57.9 61.3 73.0 58.6 51.2 75.0 58.4 43.8 75.5 57.7 38.8
Yi-9B (200k) 57.0 74.5 71.5 71.2 77.7 72.9 71.2 78.0 72.9 63.7 80.0 72.9 47.5
Yi-34B (200k) 63.1 74.1 71.7 62.5 74.1 72.4 60.0 76.1 72.9 63.8 78.2 72.6 53.8
Cmd-R-35B (128k) 65.6 73.0 74.6 81.2 75.3 75.5 61.3 78.9 68.7 58.8 80.5 75.3 41.2
GPT-3.5-Turbo (16k) 78.3 81.6 76.3 73.8 82.6 79.6 71.3 83.2 79.5 62.5 81.8 - -
GPT-4o (128k) 70.7 85.8 87.4 86.3 87.0 87.8 81.3 87.0 88.4 83.8 87.5 89.1 88.8

Needle-in-a-haystack-style Visualization


Task Haystack Results of Mistral-7B (32K).


Detailed Report of Mistral-7B (32k)
Detailed Report
Detailed Report of Mistral-7B (32K), N-task=16, N-shot=8.
Detailed Report of FILM-7B (32k)
Detailed Report
Detailed Report of FILM-7B (32K), N-task=16, N-shot=8.
Detailed Report of GPT-3.5-Turbo (16k)
Detailed Report
Detailed Report of GPT-3.5-Turbo (16K), N-task=16, N-shot=4.
Detailed Report of GPT-4o (128k)
Detailed Report
Detailed Report of GPT-4o (128k), N-task=16, N-shot=8.

Controlled Experiments

To understand the causes behind the failure cases in Task Haystack, we conduct controlled experiments that isolate factors like recency bias (models favoring information at the end of the context) and distractability (models getting distracted by irrelevant information). The results confirm that both factors contribute to the performance degradation on Task Haystack. Additionally, model performance dropped when instructions were paraphrased at test time and when few-shot ICL demonstrations of a single task were repeated multiple times. These observations highlight the limitations of current long-context models in terms of their robustness, language understanding, and context utilization.

Setting Input Prompt Example Controlled Factors
Long Ctx. Distraction Recency
Baseline (Single-task ICL) T1 Train T1 Test ✖️ ✖️
Random Random Text T1 Train T1 Test
Repeat T1 Train T1 Train T1 Train T1 Test ✖️
Repeat+Shuffle T1 Train 🔀 T1 Train 🔀 T1 Train T1 Test ✖️
Recall (Lifelong ICL) T1 Train T2 Train T3 Train T1 Test ✖️
Replay T1 Train T2 Train T3 Train T1 Train T1 Test
Remove T2 Train T3 Train T1 Test N/A
Paraphrase T1 Train T2 Train T3 Train 🔁 T1 Test ✖️
🔀: Shuffle the order of few-shot examples 🔁: Use a paraphrased task instruction at test time.
First Image Description First Image Description
Left: Controlled Experiments Results on FILM-7B (32k) and Mistral-7B (32k); Right: Repeated ICL as "Multi-epoch" ICL.