Just realized my fine-tuned LLM was dumber than the base model

Spent 3 weeks curating a custom dataset for a customer service chatbot. Thought I was being clever by adding 5000 examples of how to say no to refunds. Ran benchmarks and the base GPT model actually scored 15% higher on sentiment accuracy. Turns out my curated data was just injecting my own bias into the model. Has anyone else accidentally made their AI dumber by overfitting it to specific outcomes?

2 comments

2 Comments

angelafisher2mo ago

are you sure that 15% drop is actually a big deal in real world use? benchmarks can be misleading especially if your customer service scenarios are different from the test set. sounds like your data just wasn't diverse enough but fine tuning still has its place for specific tasks.

wright.dakota2mo ago

Haha this hits way too close to home. I did the exact same thing with a small support bot I was testing for my cleaning business, threw in tons of "we can't do that" examples and it started sounding super defensive and rude. @angelafisher is right though, my benchmarks were showing a drop but honestly in real chats it actually worked okay for turning down unreasonable requests, just needed more variety in the dataset.