You would think that the training set for the models already included enough of Mensa etc iq tests so that the model knows how to do these kinds of tests. It takes humans 2 or at most 3 examples to "get" what the test is asking for, and then they can start filling the answers to the actual questions. Meanwhile it takes hundreds of answers at least (in the public set) to train o3 to do this test.
The need for a huge training set to solve simple questions has never stopped bewildering me. I think to get a human-like intelligent model we need to figure out why humans learn from 2 examples and the models don't. But I don't mean to say that the current models aren't intelligent in their own way or aren't useful already.
Human intelligence is bootstrapped by biological evolution and the society, neither of which is fast or efficient. Truly individual part of the intelligence is tiny, it's vastly overrated and relies on these two. Similarly, LLMs perform in-context learning, which is much more efficient because it relies on the pre-baked knowledge. Yes, the generalization ability is still incomparable to humans, but it's entirely possible that much better ability is achievable by slowly bootstrapping it.
Sounds fishy to me