There’s a Strawberry Test (if I can call it that) that’s become a bit of a running joke in generative AI. It involves asking an LLM something like: “How many times does the letter r show up in strawberry?”
Most LLMs will usually answer, confidently, that “r” shows up two times in strawberry. Which is obviously wrong, though not totally unsurprising, given that LLMs don’t process words the same way as we humans do.
Enter Claude 3.5 Sonnet, the latest LLM from Anthropic. According to benchmarks, 3.5 Sonnet is better than all other top LLMs at a lot of things, essentially leading the pack as of now.
But can Claude 3.5 Sonnet count the “r” letters in strawberry? Let’s see:
Uh-oh, 404. Counting abilities successfully not found. Doubling Down On Error Mode initiated.
What if it’s my fault though? Maybe the question is just too simple. The new Claude is actually good, like really good, at more complex reasoning — or at least at giving the impression that it’s good. So maybe it can properly count letters if I give it a slightly more challenging strawberry-flavored sentence. Let’s find out:
Myup. The replies were, again, wrong. Way too wrong, since Claude also identified “very” as containing zero “r” letters. That was ve y astute of you, Claude, ve y astute.
What’s more interesting is the uber-confident nature in which Claude can defend its faulty assumptions. I replied to the wrong answers from above with this (half joking, half serious):
I was hoping that, after re-checking its answers, Claude would finally be able to spot the errors. It can do that sometimes. But it did not in this instance and, instead, it answered with:
Broken Record Mode initiated. And the whole “it may be you, human, who are wrong” + “rest assured, there’s no issue with my functionality” thing gave me strong HAL 9000 vibes.
To be clear, being unable to identify all “r” letters in “strawberry” doesn’t take much away from the fact that Claude 3.5 Sonnet is remarkably good overall. Plus, the new Artifacts feature from Anthropic is a great addition to Claude’s UI and I can see this pushing LLM boundaries into uncharted territories.
***
For the record, GPT-4o and Google Gemini Advanced also failed the strawberry test when I checked. See my convos here (ChatGPT) and here (Gemini).
Anthropic, Google, and OpenAI could probably fix this particular strawberry problem through RLHF. But the issue at large — that LLMs are bad at counting letters and at counting in general — isn’t likely to go away soon.
Now, the question for the future is: Should a supposed superintelligence need to know how to solve trivial, useless tasks like identifying the number of Rs in a word? I would very much think so. But maybe I’m wrong.
Maybe a superintelligence will be so focused on the super aspect of everything that our silly, mundane, human-centric reality would not matter to it at all. Maybe a superintelligence would be solving really hard problems during 99% of its uptime. And, every now and then, to show it didn’t forget where it came from, it would send us, via radio waves, the psychedelic words and voice of John Lennon from a bygone century:
“Let me take you down
Cause I’m going to Strawberry Fields
Nothing is real.”
Leave a Reply