> Now ask the best LLM trained on "all the data" to translate some fragment of some isolate language not in its training set and not very related to existing languages.
If you give them the dictionary and grammar book as in-context instructions, it can do pretty well.
“Gemini v1.5 learns to translate from English to Kalamang purely in context, following a full linguistic manual at inference time. Kalamang is a language spoken by fewer than 200 speakers in western New Guinea. Gemini has never seen this language during training and is only provided with 500 pages of linguistic documentation, a dictionary, and ~400 parallel sentences in context. It basically acquires a sophisticated new skill in the neural activations, instead of gradient finetuning.”
Synthetic data might be the answer if you're fine with any data, but I haven't came across many synthetic datasets that are of high quality, and if you want high quality output from a LLM, I'm not sure Tiny Stories et al can provide that.
> Once, there was a girl who wanted to write a story. She thought and thought about what she could write about. She felt it was too boring to just write about trees and flowers. Suddenly, an idea came to her. She decided to write about her waist. She started to write about how her waist was round, and how it jiggled when she danced. Her story was so fun and exciting! She wrote about how she liked to put a belt around her waist and how it made her feel smarter. She even wrote a rhyme about her waist: "My waist is round and jiggly, And when I dance, it's so wiggly." The girl was so proud of the story she wrote. She was no longer bored - writing about her waist was much more fun!
Hardly high quality "story", and an LLM training on data like that won't have high quality output no matter how much you train it.
Edit: Another example from Tiny Stories, just because how fun they end up being:
> One day, a little boy named Jack was playing in his room. He decided to go and sit on his favourite chest. When he sat down, he noticed something unusual. The chest smelled smelly! Jack had never noticed a smelly smell before and he couldn't work out what it was. Jack's Mum heard him say 'That chest smells smelly', so she came into his room to see what was happening. When she saw the chest, she knew what was wrong. Jack's little puppy had been using the chest as a bed! His Mum scooped the naughty puppy up in her arms and took him outside. When the puppy was outside, the smelly smell went away. Jack was so relieved! He sat back down on the chest, and said 'Ahhh, much better!'
Do people really expect to be able to train on this and get high quality output? "Garbage in, garbage out", or however that goes...
>This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention).
>In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.
The point of TinyStories isn't to serve as an example of a sophisticated model, but rather to show that the emergent ability of producing coherent language can happen at smaller scales, and from a synthetic data set, no less. TinyStories is essentially the language model equivalent of a young child, and it's producing coherent language -- it's not producing grammatically correct nonsense like the famous "colorless green ideas sleep furiously" phrase from Chomsky.
>but I haven't came across many synthetic datasets that are of high quality
I'm not really sure what your personal experience has to do with the viability of synthetic data; it's already been proven to be a useful resource. For example, Meta directly stated this upon the release of their Llama 3 model:
>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3. We also leveraged synthetic data to train in areas such as coding, reasoning, and long context. For example, we used synthetic data to create longer documents to train on.
It's grammatically correct. Correct grammar despite of it being semantically nonsense, still not defined how small it can get. GPT-2's grammar was atrocious.
I asked my coworkers in the office but none was able to answer that. Not sure because they were are ESL (me included) or because they were GPT in disguise.
I couldn't get it either, but I was always bad at complete the sequence style questions on IQ tests. I see so many possibilities and I get overwhelmed trying them all out. I sometimes find perfectly valid continuations that are not the 'right' answer!
Hmm can't say I entirely disagree with them on that one. I mean it's clearly not a harmful phrase but it definitely is a useless one.
It carries almost zero information. Who is going to read "trigger warning" and think "oo they know that I'm highly sensitive about this specific unknown subject. I don't want to get triggered, I'll stop!"
Contrast it with something like "spoilers" where everyone agrees on what it means and people generally really don't want to read spoilers.
Synthetic data are the answers. For example see Tiny Stories dataset (https://arxiv.org/abs/2305.07759).
> Now ask the best LLM trained on "all the data" to translate some fragment of some isolate language not in its training set and not very related to existing languages.
If you give them the dictionary and grammar book as in-context instructions, it can do pretty well.
“Gemini v1.5 learns to translate from English to Kalamang purely in context, following a full linguistic manual at inference time. Kalamang is a language spoken by fewer than 200 speakers in western New Guinea. Gemini has never seen this language during training and is only provided with 500 pages of linguistic documentation, a dictionary, and ~400 parallel sentences in context. It basically acquires a sophisticated new skill in the neural activations, instead of gradient finetuning.”