You Shouldn’t Fine-tune LLMs for Domain-specific Tasks!?

7 min readMay 20, 2024

Photo by Possessed Photography on Unsplash

Introduction

Large Language Models (LLMs) have become a powerful tool across various applications. Their vast knowledge base, acquired through pre-training on massive amounts of text data, allows them to excel in tasks like question answering and text generation. However, for specific domains, further tailoring is often needed. This is where fine-tuning comes in. (Fine tuning is a process where the LLM is trained on additional data specific to the desired domain)

However, a recent study by Zorik Gekhman, Gal Yona, Roee Aharoni et. al. (2024) named “Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?” raises concerns about the effectiveness of fine-tuning for LLMs. Their research suggests that introducing new knowledge through fine-tuning might not be as straightforward as we think. Let’s dive deep into what this research is and what are the key takeaways.

Cracking Open the Study

While the previous section highlighted the potential drawbacks of fine-tuning LLMs, let’s dive deeper into how the research team designed their experiment to investigate this phenomenon.

The core idea was to understand how introducing new knowledge through fine-tuning affects the model’s performance. To achieve this, they created a controlled setting. Imagine a fine-tuning dataset (D) and a pre-trained LLM (M). The researchers created various versions of dataset D, where some included more “unknown” information compared to M’s existing knowledge.

Here’s a breakdown of the approach:

Focusing on Factual Knowledge: The study concentrated on factual knowledge that could be structured in a specific format: (subject, relation, object). Think of it as a way to represent facts like “Paris” (subject) is “located in” (relation) “France” (object).
Converting to Questions and Answers: This factual knowledge was then transformed into a closed-book question-answering (QA) format. Imagine a question like “Where is Paris located?” with the answer “France”.
Building the Dataset: The researchers leveraged an existing resource called ENTITYQUESTIONS, which provides question-answer pairs based on factual information from Wikidata. This ensured a broad range of factual knowledge across various domains.
Fine-Tuning the Model: They used a specific pre-trained LLM model (PaLM 2-M base) as the starting point (M) and fine-tuned it on the different variations of dataset D.
Measuring Success: To assess performance, they focused on whether the model’s answer perfectly matched the ground truth answer (exact match).

This setup allowed the researchers to isolate the impact of “unknown” information within the fine-tuning dataset and analyze its effect on the LLM’s performance.

Knowledge Categories and Their Influence

Remember how they focused on factual knowledge structured as (subject, relation, object) triplets? Here’s where it gets interesting. They classified these triplets into different categories based on how well the pre-trained LLM (M) aligned with the information. This classification plays a crucial role in understanding how the LLM interacts with new knowledge.

The researchers proposed a hierarchy named SliCK (Sliding Into Categories of Knowledge).

Known: This encompasses information the LLM is likely familiar with based on its pre-training.

Highly Known: Facts the LLM confidently possesses.
Maybe Known: Information the LLM might know, but with less certainty.
Weakly Known: Facts the LLM might have some exposure to, but with very low confidence.

Unknown: This category represents entirely new information for the LLM, beyond its pre-trained knowledge base.

By introducing these categories, the team aimed to understand how the LLM’s performance differed when fine-tuned on data with varying degrees of familiarity.

The Findings

The researchers created fine-tuning datasets (D) with varying proportions of “Unknown” examples (entirely new information for the LLM). They measured performance based on how well the LLM answered questions in a separate test set (closed-book format, meaning the LLM couldn’t access the training data while answering).

Here’s what they found:

Higher Unknown Ratio, Lower Performance: As the proportion of “Unknown” examples increased, the LLM’s performance generally declined. This suggests that “Unknown” examples were less helpful for the LLM compared to familiar information (“Known”).
Early Stopping is Key: Performance also depended on how long the LLM was fine-tuned. Stopping early (EARLY_STOP) typically yielded the best results. Longer training (CONVERGENCE) often led to a performance drop, likely due to overfitting on the training data. Interestingly, this overfitting effect became worse with a higher percentage of “Unknown” examples.

Train and development accuracies as a function of the fine-tuning duration, when fine-tuning on 50% Known and 50% Unknown examples. Unknown examples are fitted substantially slower than Known. The best development performance is obtained when the LLM fits the majority of the Known training examples but only a few of the Unknown ones. From this point, fitting Unknown examples reduces the performance. (**Zorik Gekhman, Gal Yona, Roee Aharoni et. al.** **(2024)**)

Are Unknown Examples Harmful or Neutral?

The initial results suggested a performance drop with more “Unknown” examples. But this could simply be because there were fewer “Known” examples (helpful ones) in those datasets. To isolate the impact of “Unknown” examples, the researchers removed them entirely from some datasets, creating “DKnown” versions.

The findings were intriguing:

Early Stopping Neutralizes Unknown Examples: When stopping early (EARLY_STOP), the performance of the original dataset (D) and the “DKnown” version (without unknowns) were almost identical. This suggests that “Unknown” examples had a neutral effect in this scenario (since removing them barely impacted performance).
Unknown Examples are Harmful with Longer Training: With longer training (CONVERGENCE), “Unknown” examples actually hurt performance. The original dataset (D) underperformed compared to “DKnown” (without unknowns), and this gap grew with a higher percentage of unknowns. This suggests that “Unknown” examples can be detrimental when the LLM overfits them during extended training.

Why Unknown Examples Are Tricky

The researchers dug deeper to understand how the LLM interacted with different knowledge categories. They analyzed how well the LLM learned (“fitted”) examples during fine-tuning.

The LLM learned “Unknown” examples significantly slower compared to “Known” ones. At the early stopping point (EARLY_STOP), the LLM had fit most of the “Known” examples but only a small portion of the “Unknown” ones. This explains why “Unknown” examples had a neutral effect in this scenario — the LLM hadn’t learned them well enough to be impacted by them.

Results of the linear model for predicting the test accuracy. (**Zorik Gekhman, Gal Yona, Roee Aharoni et. al.** **(2024)**)

Generalization to Completely New Information

So far, the LLM was tested on questions related to the types of information it saw during fine-tuning. The researchers explored if these findings held true for entirely new information (out-of-distribution data). They tested the LLM on questions with relations (types of facts) it hadn’t encountered before.

The results were consistent being a higher proportion of “Unknown” examples led to lower performance on these completely new questions and the LLM learned “Unknown” examples slowly and fitting them during longer training was detrimental to performance on new information as well.

Conclusion: Rethinking Fine-Tuning for LLMs

This research sheds light on a potential drawback of using supervised fine-tuning to introduce new knowledge to Large Language Models (LLMs). Our findings suggest a correlation between fine-tuning entirely new information (“Unknown” examples) and an increased tendency for the LLM to generate incorrect responses (“hallucinations”).

Here’s a summary of the key takeaways:

LLMs Struggle with New Knowledge: The study reveals that LLMs have difficulty integrating entirely new facts during fine-tuning. They primarily leverage their pre-existing knowledge base, with fine-tuning acting more as a tool to refine its application.
Unknown Examples and Overfitting: Introducing a high proportion of “Unknown” examples during fine-tuning can lead to overfitting, where the LLM prioritizes these new examples and performs poorly on questions related to its existing knowledge.
Early Stopping as a Mitigation Strategy: The research suggests that stopping the fine-tuning process early (EARLY_STOP) can help mitigate the negative impact of “Unknown” examples. This is because the LLM hasn’t had enough time to overfit on them.
Filtering Unknown Examples: An alternative approach involves filtering out “Unknown” examples from the fine-tuning dataset altogether. Our initial findings suggest this can reduce the risk of overfitting without compromising performance.
Superficial Alignment Hypothesis Challenged: Our results challenge the notion that fine-tuning simply aligns with the LLM’s output format for user interaction. We observed that the selection of fine-tuning examples significantly influences the model’s ability to utilize its pre-existing knowledge.

Limitations of the Study and Future Directions

While these findings offer valuable insights, there are limitations to consider. The study focused on a single LLM, and further research is needed to explore if these results hold true for other models. Additionally, the research primarily focused on closed-book question-answering tasks. Future work is necessary to validate these findings in more complex scenarios like long-form text generation.

Overall, this research encourages a critical reevaluation of fine-tuning practices for LLMs. By understanding the challenges associated with introducing new knowledge, we can develop more effective strategies for fine-tuning LLMs and harnessing their full potential.