> You do a paper showing that problem X can be solved slightly better by downloading and training on a billion tweets.
That's true. Sometimes you might try to tweak the algorithm itself rather than the data, though, or experiment with different kinds of preprocessing or something, and in those cases it would be helpful to be able to do different experiments with shared datasets.
My limited experiences were from around the time deep learning was only about to become a big thing, so it might have been different then. Maybe you nowadays just throw more tweets and GPUs at the problem.
You do a paper showing that problem X can be solved slightly better by downloading and training on a billion tweets.
But you don’t have the copyright to those tweets, so you can’t share data.
> don't people do cross-validation or something
A lot of stable problems comes with a dataset already split into train and test.