One can manipulate LLM rankings to put any model in the lead—only by modifying the single character separating demonstration examples.
- MMLU performance varies by +/- 23% depending on the choice of delimiter across leading open model families (Llama, Qwen, and Gemma).
- Closed models, GPT-4o, are also brittle to the choice of delimiter.
- MMLU performance varies by +/- 23% depending on the choice of delimiter across leading open model families (Llama, Qwen, and Gemma). - Closed models, GPT-4o, are also brittle to the choice of delimiter.