Heejae's website | Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Prerequisite
Introduction
Experimental Setup
Ground truth Matters Little
Why does In-Context Learning work?
Conclusion

Prerequisite

In-context learning

In-context learning(ICL) means that the model understands the contextual meaning within a prompt and generates responses based on it. In other words, ICL does not involve updating the model's weights, like in pretraining or fine-tuning, and it does not require a separate model training process. ICL's approach mirrors the human cognitive reasoning process, making it a more intuitive model for problem-solving.

ICL can be divided into zero-shot, one-shot, and few-shot depending on the given context and examples.

Zero-shot

Zero-shot refers to the ability to perform a task without any given examples.

  Prompt: What is the 'red apple' in French?

  GPT: "pomme rouge"

One-shot

One-shot refers to performing a task with the help of a single example.

  Prompt: Let's say 'red apple' as 'rouge apple'. 
 Then what is the green watermelon. 

  GPT: "vert watermelon"

Few-shot

Few-shot learning refers to performing a task with the help of multiple examples.

  Prompt: Let's say 'red apple' as 'rouge apple'. And 'green watermelon' as 'vert watermelon'. 
 Then what is the 'green apple'?

  GPT: "vert Apple"

Introduction

In-context learning is a useful method to improve a model's performance without additional training. However, despite this performance improvement, there is little understanding of how it works and which aspects of the demonstrations contribute to end task performance.

This paper investigates how the model learns and which elements of the demonstration impact the performance of downstream tasks. The goal and conclusion of this research are as follows:

Evaluate the importance of ground-truth labels in demonstrations: replacing the labels in demonstrations with random labels barely hurts performance in a range of classification and multi-choice tasks.

Identify key factors in demonstration contributing to ICL: Label space, the distribution of the input text and format.

The example of random labels

  Input: This device is the best! 

  Label: "Negative"

Experimental Setup

In this study, 6 language models with two inference methods, direct and channel, were used. Direct method means that the model's input is in the usual order and predicts the output corresponding to the input. But channel method means that the model is given the output first and then predicts the input.

The evaluation was conducted on 26 datasets, that cover various tasks and domains. (e.g. classification and multichoice)

Ground Truth Matters Little

To figure out the importance of correctly-paired inputs and lables in the demonstrations, the researcher compared the three methods

No demonstrations: typical zero-shot method, i.e., $argmax_{y \in C} P(y|x)$
Demonstrations w/ gold labels: typical ICL method (correct input-label mapping), i.e., $argmax_{y \in C} P(y|x_1, y_1, …, x_k, y_k, x)$
Demonstrations w/ random labels: alter gold labels from the labeled data to random labels, i.e., $argmax_{y \in C} P(y|x_1, \tilde{y_1}, …, x_k,\tilde{y_k}, x)$

Results are reported above figure. Without demonstrations, the model's performance significantly decrease. However, the trend is that replacing gold labels with random labels only marginally hurts performance. This indicates that models are capable of recovering the expected input-label correspondence for the task; but, it is not directly from the pairings in the demonstrations.

Why does In-Context Learning work?

Above shows that golden input-label mapping is not a crucial role in demonstrations. Then which aspects of the demonstrations improve the model performance?

The researcher define three aditional asoects of the demonstrations that potentially provide learning signal

The distribution of the input text : the underlying distribution of the input text ($x_1 … x_k$) in the demonstrations
The label space : the space covered by $y_1 .. y_k$
The format : the use of input-label pairing as the format

Impact of the distribution of the input text

The researcher compared the performance of the model with out-of-distribution (OOD) demonstrations while keeping the label space and the format. The results show that the model's performance is significantly affected by the distribution of the input text.

This shows that in-distribution inputs impact the performance of model due to conditioning on the in-distribution text making the task closer.

Impact of the label space

In this experiment, researcher used random English words as labels for all k pairs. They construct a random subset of English words $|C_{rand}| = |C|$ and randomly pair $\tilde{y_i} \in C_{rand}$ with $x_i$

Along with above results, the channel models does not show significant drop or sometimes even an increase. This is because the channel models only condition on the labels. However, the direct models exhibits the performance gap. This indicates that they learn the distribution of the label space for a given input.

Impact of the format

The "format" means the use of input-label pairing. Experiments were conducted by changing the format using methods such as demonstrations with no labels and demonstrations with labels only.

As shown in the figure above, altering the format (indicated by the purple and green bars) led to performance that was nearly the same or even lower than the no-demonstration case. This suggests that format is essential in guiding the model to replicate the intended inference process.

Impact of the meta-training

MetaICL is minimally affected by random labels; however, changes in format—whether using only random labels or no labels at all—significantly impact performance. This suggests that meta-training emphasizes learning straightforward elements, like format, over correctly matched input-label examples from demonstrations, implying that format may be easier for the model to leverage. Additionally, altering the output space (e.g., using random English words) has little effect on Channel MetaICL compared to Direct MetaICL. This implies that utilizing the input text space the model needs to generate is simpler than utilizing the input text space the model relies on for conditioning.

The example demonstrations used in identifying the aspect of the demonstration

Conclusion

Format Matters More Than Labels: The study shows that correct input-label pairings are less critical than maintaining a consistent format in demonstrations. Random labels have minimal impact, but format changes significantly reduce performance, highlighting format as a primary cue for the model.
MetaICL Leverages Simple Patterns: MetaICL focuses on simple elements like format rather than precise input-label matching, making format an easily exploitable structure for In-Context Learning.
Exploiting Input Structure Over Correctness: The findings suggest that when optimizing In-Context Learning, designing demonstrations with clear structural hints may be more beneficial than ensuring perfect labels.

Contents