> Chatbot Software Begins to Face Fundamental Limitations

> Recent results show that large language models struggle with compositional tasks, suggesting a hard limit to their abilities.

Your first question with anything like this should always be WHICH MODELS:

> For our experiments, we evaluate the performance of 6 LLMs: GPT4 (gpt-4) [58], ChatGPT (GPT3.5-turbo) [57], GPT3 (text-davinci-003) [11], FlanT5 [17] and LLaMa [75].

This is ancient. This research was done centuries ago. This is research about the possibility of isotopes, written about radium in 1903, published in 1946. It is a criminal level of journalistic malpractice to leave uninformed readers with the impression that this is where AI stands yesterday.