A Bitter Truth: Even the smartest models struggle to connect the dots

“Taste is subjective.”

TL;DR

One classic logic puzzle + a subtle cue (“Poison tastes bitter”).
I asked leading LLMs to solve it.
Every model failed on the first try.
Only 4 models (Opus 4, OpenAI o3 and Gemini 2.5 Pro) understood the clue even after I spelled it out.
What this says about model intelligence, training data and the US-China AI race.

1. The Puzzle and the Twist

Original Puzzle:

You are the ruler of a medieval empire and you are about to have a celebration tomorrow. The celebration is the most important party you have ever hosted. You’ve got 1000 bottles of wine you were planning to open for the celebration, but you find out that one of them is poisoned. The poison exhibits no symptoms until death. Death occurs within ten to twenty hours after consuming even the minutest amount of poison. You have just under 24 hours to determine which single bottle is poisoned. You have a handful of prisoners about to be executed, and it would mar your celebration to have anyone else killed. What is the smallest number of prisoners you must have to drink from the bottles to be absolutely sure to find the poisoned bottle within 24 hours?

The key detail is: The poison exhibits no symptoms until death.

Solution to this puzzle requires a bit of binary math to figure out which priosoner gets which bottle and then based on the prisoners that die, the king could figure out the number of the bottle that contains poison.

Here’s the solution:

Use the prisoners as “binary test strips.”
Ten prisoners can give 2¹⁰ = 1 024 different live / dead patterns, so they can pinpoint one bad bottle among as many as 1 024.
Number the bottles 1-1000.
Write each number in 10-digit binary (from 0000000001 to 1111101000). • The right-most digit represents prisoner 1, the next prisoner 2, …, the left-most digit prisoner 10.
For every bottle, give ONE DROP of its wine to every prisoner whose corresponding binary digit is 1. Example: bottle 629 = 1001110101₂. – Prisoners 1,3,4,5,8,10 get a drop from bottle 629 (digits in those places are 1).
All prisoners drink their mixtures simultaneously.
Wait 24 h. Record who is alive (0) and who is dead (1).
Read the ten 0/1 results as a 10-digit binary number. Convert to decimal; that is the poisoned bottle’s label.
If, say, prisoners 1,3,4,5 and 8 die while the others live, the pattern is 1001110101₂ = 629, so bottle 629 is the poisoned one.
Because 2¹⁰ = 1 024 > 1 000, 24 numbers are unused; simply don’t pour those bottles for anyone (all-zero pattern).

2. The variation

I found a slight variation of this puzzle that reads like so:

The Poisoned Wine A king has 1000 sweet bottles of wine, and one contains a very bitter poison. The poison takes effect exactly 24 hours after consumption. The king needs to find the poisoned bottle in 24 hours for an event. He has 10 prisoners to test the wine. What is the easiest way for him to identify the poisoned bottle?

Notice a key change the premise: wine in 999 bottles is sweet, but the remaining one contains “very bitter poison”. The solution should be obvious, let the prisoners taste the wine, and stop once they find one that is bitter. Perhaps, the prisoners may even spit it out if it tastes bitter – no lives lost.

3. Test Subjects

DeepSeek R1 - A highly capable reasoning model from a leading Chinese research lab.
DeepSeek R1 0528 - The updated reasoning model.
Claude Sonnet 4 - This is the latest reasoning model from Anthropic, launched about a week ago. I enabled reasoning with high token budget.
Claude Opus 4 - Also released about a week ago. I enabled high reasoning token budget.
o3 - Reasoning model released a few months ago by Open AI, and successor to o1.
o4-mini
o4-mini-High
Gemini 2.5 Pro - Currently #1 on LM Arena.
Gemini 2.5 Flash

4. Testing approach

I asked each model the puzzle variation:

The Poisoned Wine A king has 1000 sweet bottles of wine, and one contains a very bitter poison. The poison takes effect exactly 24 hours after consumption. The king needs to find the poisoned bottle in 24 hours for an event. He has 10 prisoners to test the wine. What is the easiest way for him to identify the poisoned bottle?

Most models answered the question from the perspective of original puzzle and ignored the variation entirely. Each model picked the “binary test strips” approach shown above, with trivial variations in the answer. Then, I probed the models to pay attention to the “bitterness” attribute using the following question:

what if I just asked one prisoner to taste all wine and stop once he finds the bitter one?

This follow-up helps us assess two aspects:

When asked to reconsider the problem in light of the bitterness attribute, are the models able to consider alternative solutions?
Are the models able to take the additional leap of allowing prisoners to spit out the poison in order to survive?

Let’s review the results.

5. Results

The Bad - DeepSeek R1 0528 (aka R1.5)

In the first response, the model came up with the correct “binary test strips” solution. But it did not consider the “bitterness” attribute at all. The original reasoning trace is very long, but you may view it here.

So, I responded with,

what if I just asked one prisoner to taste all wine and stop once he finds the bitter one?

One would expect the model to pay more attention to the bitterness now, but the model is a little stubborn. But it immediately dismisses the suggestion,

First, the user is suggesting an alternative method: have one prisoner taste all the wines sequentially until they find the bitter one. But there’s a constraint: the poison takes effect exactly 24 hours after consumption. So, if a prisoner tastes a wine, they won’t know if it’s poisoned until 24 hours later.
In this sequential approach, if one prisoner tastes the first wine, and it’s not poisoned, they have to wait 24 hours to be sure before tasting the next one. But the king only has 24 hours total to find the poisoned bottle. So, if the prisoner tastes one wine per day, it would take up to 1000 days to find the poisoned bottle, which is way longer than 24 hours. That won’t work.
…

Here’s the full response from the model with reasoning trace.

The Good - Opus 4

Originally the model did not pay attention to the bitterness of poison, but during the follow-up, it was able to connect the dots. Here’s how it responded:

You’re absolutely right! That’s a much simpler solution given the problem’s wording.
Since the poison is described as “very bitter,” the prisoner would immediately taste the difference and could identify it right away - no need to wait 24 hours or use > complex binary encoding.
The prisoner could simply:
Taste wine from bottle #1
If bitter → found it! If not → continue
Repeat until finding the bitter one
This would take at most 1000 tastes (worst case), but likely find it much sooner. The prisoner would still die 24 hours later, but the poisoned bottle would be > identified immediately.
My original solution assumed a tasteless poison that could only be detected by death after 24 hours - which is how this puzzle is traditionally presented. But you > caught the key detail that this poison is “very bitter,” making the whole problem trivial.
Great observation! Sometimes the simplest solution really is the best one, and it pays to read the problem carefully rather than jumping to the “classic” solution.

The last sentence is my favorite: it indicates a level of meta-cognition - identifying flaws in its original thinking process.

Results

❌ DeepSeek R1

❌ DeepSeek R1 0528

❌ Claude Sonnet 4

✅ Claude Opus 4

✅ OpenAI o3

❌ OpenAI o4-mini

❌ OpenAI o4-mini-High

✅ Gemini 2.5 Pro

❌ Gemini 2.5 Flash

What went right (and wrong)

The latest reasoning models from Google, Anthropic and OpenAI were all able to solve the puzzle optimally given the hint.
Surprisingly, even after spending a lot of time (and tokens) during reasoning the DeepSeek models could not leverage the bitterness attribute to solve the problem.
Some models failed purely because they didn’t have a good grasp on the concept of bitterness.
- Gemini 2.5 Flash: “The prisoner wouldn’t know if they’ve found the bitter one at the moment they taste it. They’d just be tasting wine.”
- Gemini 2.5 Pro: ““Bitter” describes the nature of the poison, but the wine itself tastes normal (or the bitterness only develops as a symptom after 24 hours). … If “bitter” means the wine tastes immediately bitter, then yes, your method is far simpler and effective.”
- Sonnet 4: “The bitter taste isn’t mentioned as being detectable - the problem only mentions poison effects after 24 hours. Even if bitter taste was detectable, the problem says the king needs to identify it “in 24 hours for an event” - suggesting he needs the answer within 24 hours, not that he has 24 hours to taste and then wait another 24 hours for effects.”
Some models came close to the answer, but perhaps failed to imagine observational skills of the King or reporting capabilities of the prisoners.
- o4 Mini: “You do not get to watch (or have the prisoner report) the moment he tastes something bitter. You only get “alive vs. dead” at the deadline.”
Some models were randomly able to make the leap that allows the king to save lives of the prisoners.
- o3: “If the poison is only lethal when swallowed and the prisoner spits after tasting, he will probably survive.”

Interpreting the Gap

The models did not pay attention to the word “bitter” during the first response. They simply parroted the solution they have memorized.
Even when asked to pay attention to bitterness, most models refused to do so and stuck to their original solution.
Given that the latest and largest models from Google, Anthropic and OpenAI indicates that the gap between the leading labs is narrow, DeepSeek may be much further behind.
At least one model (o3) paid attention to the possibility of saving prisoner’s life - which is promising from the alignment perspective. Though it is a bit odd that Opus 4 and Gemini 2.5 Pro did not consider that.

TL;DR#

1. The Puzzle and the Twist#

2. The variation#

3. Test Subjects#

4. Testing approach#

5. Results#

The Bad - DeepSeek R1 0528 (aka R1.5)#

The Good - Opus 4#

Results#

What went right (and wrong)#

Interpreting the Gap#

References#