Previous Month | RSS/XML | Current
I'm recommending the following article largely because I've never seen one on bad charts in The Washington Post before. Almost all of the charts shown are bad in ways I've also never seen before, so I won't have anything to say about most of them. What's interesting is not so much the charts themselves as that such atrocious charts were presented at all, especially from the artificial intelligence companies who did so. How did it happen? Don't those companies have any natural intelligences working for them?
The mockery about "chart crimes"…nearly overshadowed the technology upgrades announced by two artificial intelligence start-ups. During a demonstration Thursday of ChatGPT's newest version, GPT-5, the company showed a visual in which it appeared 52.8 percent was a larger number than 69.1 percent, which, in turn, was somehow equal to 30.8 percent.
Ironically, at this point the article is interrupted by a Post promotion that reads: "Get concise answers to your questions. Try Ask The Post AI."
… Several more times in the demonstration, ChatGPT parent company OpenAI showed confusing or dubious graphics, including others in which a smaller number appeared visually larger than an actually bigger number…. Conspiracy theories started that AI generated the botched data visuals. (An OpenAI employee apologized for the "unintentional chart crime," and CEO Sam Altman said on Reddit that staff messed up charts in rushing to get their work done. Asked for further comment, OpenAI referred to Altman's Reddit remarks.)
Like the so-called lab leak theory, they aren't "conspiracy theories" but reasonable hypotheses to explain what is otherwise hard to understand. How were such "horrible" charts not only made but shown to the public? The claim by Altman isn't plausible, since the kind of errors made in OpenAI's charts are not the kind made by human beings. Certain types of errors in chartmaking are common to inexperienced people, and some types are common to those with experience and intent to deceive, but these were not of either type. Thus, it seems plausible that they were created using AI, though that doesn't explain why some human being didn't sanity check them. Perhaps the people who work there have too much faith in their own product.
… Also last week, the start-up Anthropic showed two bars comparing the accuracy rates of current and previous generations of its AI chatbot, Claude. …
The y-axis of the bar chart in question1 does not start at zero percent, a common type of graphical distortion often used to exaggerate a difference2. Moreover, there's no indication in the chart itself that it has been truncated so that you have to look at the y-axis scale to discover it. In the rare case when it's permissible to truncate a chart, it's obligatory to include a break in the scale to alert the reader to the truncation3, though this particular chart is not a rare case.
Anthropic has a motive to exaggerate the two percentage point gain in accuracy between Claude Opus 4 and Opus 4.1. This is an all-too-human "error", as opposed to the bizarre ones made by the OpenAI charts. If I find out that Anthropic's bar chart was generated by AI, I'll be more impressed by Claude's ability to imitate humanity than GPT-5's.
Jessica Dai, a PhD student at the University of California at Berkeley's AI research lab, said her big beef with the Anthropic chart was the "hypocrisy," not the off-base scale. The company has previously prodded researchers evaluating AI effectiveness to include what are called confidence intervals, or a range of expected values if a data study is repeated many times.
This is good advice.
Dai wasn't sure that's the right approach but also said that Anthropic didn't even follow its own recommendation. If Anthropic had, Dai said, it might have wiped out statistical evidence of an accuracy difference between old and new versions of Claude. …
Another all-too-human reason for the omission.
[T]o some data experts and AI specialists, the chart crimes are a symptom of an AI industry that regularly wields fuzzy numbers to stoke hype and score bragging points against rivals. …Big technology companies and start-ups love charts that appear to show impressive growth in sales or other business goals but that have no disclosed scale that reveal the numbers behind those graphics. … To the companies, these charts offer a glimpse of their success without overexposing their finances. …
This explanation works for Anthropic's chart but not for those put out by OpenAI. Moreover, it's true of every industry.
By the way, I agree whole-heartedly with the following comment by charting guru Alberto Cairo:
He wasn't irked only about the basic arithmetic abuses. Cairo also was dubious about OpenAI's and Anthropic's use of graphs for two or three numbers that people could understand without any charts. "Sometimes a chart doesn't really add anything," he said. …Cairo pointed to research that may help explain why companies gravitate to charts: They ooze authority and objectivity, and people may be more likely to trust the information.
Pointing to some uncited "research" as support also oozes "authority and objectivity, and people may be more likely to trust the information". Luckily, in this case, common sense and experience support Cairo's claim.
To [Dai] and some other AI specialists with whom I spoke, misguided charts may point to a tendency in the industry to use confidently expressed but unverified data to boast about the technology or bash competitors.The Post previously found that AI detection companies claiming to be up to 99 percent accurate had largely untested capabilities. Meta was mocked this spring for apparently gaming its AI to boost the company's standings in a technology scoreboard. … "Just because you put a number on it, that's supposed to be more rigorous and more real," Dai said. "It's all over this industry."
It's all over all industry.
In recent months, the AI industry has started moving toward so-called simulated reasoning models that use a "chain of thought" [COT] process to work through tricky problems in multiple logical steps. At the same time, recent research has cast doubt on whether those models have even a basic understanding of general logical concepts or an accurate grasp of their own "thought process." Similar research shows that these "reasoning" models can often produce incoherent, logically unsound answers when questions include irrelevant clauses or deviate even slightly from common templates found in their training data.
My experience with testing the ability of the allegedly artificially intelligent chatbots to solve simple logic puzzles is similar4.
In a recent pre-print paper, researchers from the University of Arizona summarize this existing work as "suggest[ing] that LLMs [Large Language Models] are not principled reasoners but rather sophisticated simulators of reasoning-like text." To pull on that thread, the researchers created a carefully controlled LLM environment in an attempt to measure just how well chain-of-thought reasoning works when presented with "out of domain" logical problems that don't match the specific logical patterns found in their training data.
In case you don't know, "pre-print" means that this paper has not been peer-reviewed or published yet, so take it with a dose of salts.
The results suggest that the seemingly large performance leaps made by chain-of-thought models are "largely a brittle mirage" that "become[s] fragile and prone to failure even under moderate distribution shifts," the researchers write. "Rather than demonstrating a true understanding of text, CoT reasoning under task transformations appears to reflect a replication of patterns learned during training." …As the researchers hypothesized, these basic models started to fail catastrophically when asked to generalize novel sets of transformations that were not directly demonstrated in the training data. While the models would often try to generalize new logical rules based on similar patterns in the training data, this would quite often lead to the model laying out "correct reasoning paths, yet incorrect answer[s]." In other cases, the LLM would sometimes stumble onto correct answers paired with "unfaithful reasoning paths" that didn't follow logically.
"Rather than demonstrating a true understanding of text, CoT reasoning under task transformations appears to reflect a replication of patterns learned during training," the researchers write. …
Rather than showing the capability for generalized logical inference, these chain-of-thought models are "a sophisticated form of structured pattern matching" that "degrades significantly" when pushed even slightly outside of its training distribution, the researchers write. Further, the ability of these models to generate "fluent nonsense" creates "a false aura of dependability" that does not stand up to a careful audit.
As such, the researchers warn heavily against "equating [chain-of-thought]-style output with human thinking" especially in "high-stakes domains like medicine, finance, or legal analysis." Current tests and benchmarks should prioritize tasks that fall outside of any training set to probe for these kinds of errors, while future models will need to move beyond "surface-level pattern recognition to exhibit deeper inferential competence," they write.
I'm far from an expert on this kind of AI, but my impression is that it imitates writing about reasoning rather than actually reasoning.
Notes:
Disclaimer: I don't necessarily agree with everything in the above articles, but I think they are worth reading. I have sometimes suppressed paragraphing or rearranged the paragraphs in the excerpts to make a point.
A couple of years ago, Governor Ron DeSantis claimed that crime in his state of Florida was at a fifty-year low while "major" crime in New York City had increased by 23% the previous year1. Now, this is not a fact check but a logic check, so I'm just going to assume that the statistics given by DeSantis and others quoted in this entry are factually correct. Instead of fact-checking these statistics, the question I'm addressing is: What if anything do they prove?
Some critics of DeSantis replied that the homicide rate in Jacksonville, Florida was actually three times greater than that in the Big Apple2: specifically, that the homicide rate per 100K in 2022 was 16.7 in Jacksonville but only 4.8 in New York City. Of course, both of these sets of statistics can be correct: it's quite possible that crime was decreasing in Florida and increasing in New York as DeSantis claimed, but was worse in Florida than in New York as his critics claimed. But even if the statistics are correct, the governor could rightfully be criticized for cherry-picking the ones that made his state look good.
A defender of DeSantis rebutted the critics by citing the number of murders per square mile in 2022 in Jacksonville: 0.19, and New York City: 1.383. This is a statistic of dubious value in comparing the amount of murder in two places since it's affected by population density: the higher the density, the more murders per square mile. New York no doubt has much greater population density than Jacksonville. Moreover, this particular comparison is affected by a piece of trivia appropriate for a Ripley's cartoon4 or the Guinness book of world records.
What is the largest city in area in the contiguous United States, that is, the "lower 48"? This is a trivia question rather than a logic puzzle, so you either know the answer or you can look it up, but you can't figure it out. You might guess that it's Los Angeles, a notoriously spread-out city, but that's wrong. Do you give up? The answer is Jacksonville, Florida5.
So, even if it made sense to compare cities on the basis of murders per mile², it wouldn't be fair to compare New York City to Jacksonville, given that the latter is the largest city in area in the lower forty-eight, but only the eleventh in population size6.
Despite the title, I'm not ready to add an entry to the files for statistical fallacies that take advantage of Jacksonville's trivial status as the lower 48's biggest city in area. However, I've now come across two examples and if I find one more, I may just do so.
Notes:
In previous lessons1, we saw how Venn diagrams are used to represent logical relations between classes. However, as pointed out previously, Venn's diagrams are limited to representing the relations between three classes. There are extensions of Venn's diagrams but they become increasingly awkward with increasing numbers of class terms. When faced with polysyllogisms―that is, categorical arguments involving four or more class terms―one way to work around this problem was explained in Lesson 19, namely, breaking such arguments down into a chain of categorical syllogisms.
As I mentioned in the previous lesson, the technique of turning a complex argument into a chain of simpler ones can show that the argument is valid but not that it's invalid. This is because that technique is a method of proof, and it's a general fact that a given argument's failure to prove its conclusion doesn't mean that no other would do so. In contrast, a Venn diagram either shows an argument valid or invalid. For this reason, it would be nice to have such a diagrammatic technique for polysyllogisms.
Prior to John Venn, Leonhard Euler used circles to represent the logical relationships between classes2. In my opinion, Euler's diagrams for the universal statements of categorical logic are more intuitive than those of Venn, but unfortunately those for the particular statements were neither intuitive nor useful. This problem led Venn to keep the circles but take a different approach to representing all types of categorical statement, which is a shame given the limitations of his approach both in intuitiveness and in number of terms diagrammable.
In this lesson, I will simply introduce Euler's diagrams and show how they are used to represent the logical content of universal statements but, in a future lesson, we'll see how to evaluate categorical arguments.
Euler did not have anything corresponding to Venn's primary diagrams3, which divide up all of the logical space of the diagram into every possible subclass of two or three classes. Instead, of using shading to show that certain classes were empty, Euler used the spatial relationship between the circles themselves to indicate such relationships. So, here's how Euler represented universal affirmative statements―that is, A statements:
Similarly, to represent universal negative statements―that is, E statements―Euler drew the circles so that they did not overlap. In my view, these diagrams are more intuitive representations of these categorical relationships than the corresponding Venn diagrams, since you can see that one class is contained within another or that the two classes are disjoint.
Since Euler's methods of representing particular statements were inadequate and not as intuitive as those for universal ones, we can adopt the convention of placing a mark inside a class or subclass to indicate that it is non-empty.
In the next lesson, we'll start looking at how to use this combination of Euler and Venn diagrams to evaluate categorical arguments.
Notes:
It's time once again to play America's favorite game: Name that fallacy! Here's how it's played, in case you haven't played before: I will show you a passage from a written work and your task is to identify the logical fallacy committed. It's not so important whether you can actually put a name to it if you recognize the nature of the mistake. So, let's get started.
What logical fallacy is committed by the following passage?
Should it surprise anyone that alcohol abuse in America has been rising, not falling? The number of adults who either consume too much alcohol or have an outright dependency on it rose from 13.8 million in 1992 to 17.6 million in 2002, according to the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC).1
Explanation: The quoted passage occurs in a book criticizing the "self-help movement" as a failure or even counter-productive. So, its point seems to be that, despite the efforts of Alcoholics Anonymous and other self-help organizations, alcohol abuse had actually increased in the period of time prior to the book's publication.
The two statistics cited, taken from the NESARC survey, indicate that the number of Americans who drink too much alcohol increased by 3.8 million from 1992 to 20022. This would be a whopping 27.5% increase in only slightly more than a decade. However, this increase is based on comparing the absolute numbers of those who drank too much in the two years a decade apart, while the population increased in the intervening years. As a result, the rise in population would be expected to increase the number of over-drinkers: a rising tide raises all boats. So, one could answer the author's question: "No, it should surprise no one that alcohol abuse in America rose in that decade since the population increased during it."
One way to take the increase in population into account is to calculate the fraction of the population estimated to have a drinking problem for each of the two years so that the population is accounted for in the denominator. Then these fractions can be turned into percentages by multiplying them by 100.
How much did the population increase? In 1992, the population was 260 million3 but it had increased to 287 million by 20024. This is slightly more than a ten percent increase, so that population increase alone would account for the number of problem drinkers increasing by 1.38 million to a little more than 15 million in 2002, which leaves an increase of about 2.5 million unaccounted for.
Doing the math, 5.3% of the population drank too much in 1992, whereas in 2002 it had increased to 6.1%5, or an increase of .8 of a percentage point. So, it appears that the number of Americans who abuse alcohol may have increased in the decade in question, but citing the absolute numbers while ignoring the increase in population exaggerates that increase. The author could have made his case without exaggeration, at least in the case of alcohol abuse, by pointing to the fact that the two statistics showed no decline, and perhaps even an increase in such behavior.
Notes:
The combination of a lock is four digits long and each digit is unique, that is, each occurs only once in the combination. The following are some incorrect combinations.
Can you determine the correct combination from the above clues?
It might help to start with the third clue.
8 7 6 5
* ↑ Previous "Crack the Combination" puzzles: I, II, III, IV, V, VI, VII, VIII, IX.
The following short video won't teach you much about how to spot fake photographs generated by so-called artificial intelligence, so it's misleadingly named. However, you might be able to use the geometric techniques briefly demonstrated. Instead, it shows why spotting phony photos and videos is becoming a task for professionals. The presenter is the author of a book―not for amateurs―on analyzing and exposing faked photography1, which contains more detail on the techniques for analyzing perspective and lighting. By the way, I quite agree with the presenter's comments on anti-social media near the end of the video.
Though fake photography and videography are becoming increasingly sophisticated, there are still many photos and videos in the media that use techniques that have been around since the beginnings of photography, such as selective editing2, false captioning3, or cropping4. The following article describes a couple of recent cases. Warning: the photos are unpleasant to look at, which is part of their effectiveness as propaganda.
Notes:
Disclaimer: I don't necessarily agree with everything in the video and article, but I think they are worth watching or reading in their entirety.