A machine which confabulates and cannot count is not a good fit for accounting tasks. They’ll make all sorts of subtle errors which are difficult for humans to notice.
That wouldn't even necessarily be true if models really "couldn't count", since software exists - if an LLM is making an Excel spreadsheet rather than doing everything manually, it's both much harder for it to mess up and easier to notice and recover. It's even less true given that what this paper actually tests is "LLMs don't have a literally perfect accuracy when you make them do increasingly big problems with zero thinking".
(Confabulation is IMO a much bigger problem, but it's unrelated to architecture - it's an artifact of how models are currently trained.)
They really can’t count, that’s not how they work at all. They don’t reason about maths they predict the most likely output for a given context. That’s sometimes useful but not at all the same thing.
Why do you think the results of this paper contradict these claims at all?