# The wisdom of statistically manipulated crowds

The wisdom of a crowd is often in the eye of the beholder, but most of us understand that, at its most basic level, “crowd wisdom” refers to a fairly simple phenomenon: when you ask a whole bunch of random people a question that can be answered with a number (eg, what’s the population of Swaziland?) and then you add up all the answers and divide the sum by the number of people providing those answers – ie, calculate the average – you’ll frequently get a close approximation of the actual answer. Indeed, it’s often suggested, the crowd’s average answer tends to me more accurate than an estimate from an actual expert. As the science writer Jonah Lehrer put it in a column in the Wall Street Journal on Saturday:

The good news is that the wisdom of crowds exists. When groups of people are asked a difficult question – say, to estimate the number of marbles in a jar, or the murder rate of New York City – their mistakes tend to cancel each other out. As a result, the average answer is often surprisingly accurate.

To back this up, Lehrer points to a new study by a group of Swiss researchers:

The researchers gathered 144 Swiss college students, sat them in isolated cubicles, and then asked them to answer [six] questions, such as the number of new immigrants living in Zurich. In many instances, the crowd proved correct. When asked about those immigrants, for instance, the median guess of the students was 10,000. The answer was 10,067.

Neat, huh?

Except, well, it’s not quite that clear-cut. In fact, it’s not clear-cut at all. If you read the paper, you’ll find that the crowd did not “prove correct” in many instances. The only time the crowd proved even close to correct was in the particular instance cited by Lehrer – and that was only because Lehrer used the median answer rather than the mean. In most cases, the average answer provided by the crowd was wildly wrong.

Peter Freed, a neuroscience researcher at Columbia, let loose on Lehrer in a long, amusing blog post, arguing that he (Lehrer) had misread the evidence in the study. Freed pointed out that if you look at the crowd’s average answers – “average” as in “mean” – to the six questions the researchers posed, you’ll find that they are, as Freed says, “horrrrrrrrrrrrrendous”:

… the crowd was hundreds of percents – yes, hundreds of percents – off the mark. They were less than 100% off in response to only one out of the six questions! At their worst – to take a single value, as Lehrer wrongly did with the 0.7% [median] – the 144 Swiss students, as a true crowd (unlike the 0.7%), guessed that there had been 135,051 assaults in 2006 in Switzerland – in fact there had been 9,272 – an error of 1,356%.

Or, as the researchers themselves report:

In our case, the arithmetic mean performs poorly, as we have validated by comparing its distance to the truth with the individual distances to the truth. In only 21.3% of the cases is the arithmetic mean closer to the truth than the individual first estimates.

So, far from providing evidence that supports the existence of the wisdom-of-crowds effect, the study actually suggests that the effect may not be real at all, or at least may be a much rarer phenomenon than we assume.

But since this is statistics, that’s by no means (no pun intended) the end of the story. As the researchers go on to explain, it’s quite natural for a crowd’s average answer, calculated as the mean, to be way too high – and hence ridiculously unwise. That’s because, while individuals’ underestimates for these kinds of questions are bounded at zero, there’s no upper bound to their overestimates. “In other words,” as the researchers write, “a minority of estimates are scattered in a fat right tail,” which ends up skewing the mean far beyond any semblance of “wisdom.”

Fortunately (or not), the arcane art of statistics allows you to correct for the crowd’s errors. By massaging the results – “tuning” them, as the researchers put it – you can effectively weed out the overestimates and (presto-chango) manufacture a wisdom-of-crowds effect. In this case, the researchers performed this magic by calculating the “geometric mean” of the group’s answers rather than the simple “arithmetic mean”:

As a large number of our subjects had problems choosing the right order of magnitude of their responses, they faced a problem of logarithmic nature. When using logarithms of estimates, the arithmetic mean is closer to the logarithm of the truth than the individuals’ estimates in 77.1% of the cases. This confirms that the geometric mean (i.e., exponential of the mean of the logarithmized data) is an accurate measure of the wisdom of crowds for our data.

Got that?

Well, it further turns out that the median answer – the centermost individual answer – in a big set of answers often replicates, roughly, the geometric mean. Again, that’s no big surprise. The median, like the geometric mean, serves to neutralize wildly wrong guesses. It hides the magnitude of people’s errors. The researchers point this fact out in their paper, but Freed, having criticizing Lehrer for a sloppy reading of the study, seems to have overlooked that point. Which earns Freed a righteous tongue-lashing from another blogger, the physics professor Chad Orzel:

Freed’s proud ignorance of the underlying statistics completely undermines everything else. His core argument is that the “wisdom of crowds” effect is bunk because the arithmetic mean of the guesses is a lousy estimate of the real value. Which is not surprising, given the nature of the distribution – that’s why the authors prefer the geometric mean. He blasts Lehrer for using a median value as his example, without noting that the median values are generally pretty close to the geometric means – all but one are within 20% of the geometric mean – making the median a not-too-bad (and much easier to explain) characterization of the distribution.

You get the sense that this could go on forever. And I sort of hope it does, because I enjoyed Lehrer’s original column (the main point of which, by the way, was that the more a crowd socializes the less “wise” it becomes), and I enjoyed Freed’s vigorous debunking of Lehrer’s reading of (one part of) the study, and I also enjoyed Orzel’s equally vigorous debunking of (one part of) Freed’s debunking.

But beyond the points and counterpoints, there is a big picture here, and it can be described this way: Even in its most basic expression, the wisdom-of-crowds effect seems to be exaggerated. In many cases, including the ones covered by the Swiss researchers, it’s only by using a statistical trick that you can nudge a crowd’s responses toward accuracy. By looking at the geometric mean rather than the simple arithmetic mean, the researchers performed the statistical equivalent of cosmetic surgery on the crowd: they snipped away those responses that didn’t fit the theoretical wisdom-of-crowds effect that they wanted to display. As soon as you start massaging the answers of a crowd in a way that gives more weight to some answers and less weight to other answers, you’re no longer dealing with a true crowd, a real writhing mass of humanity. You’re dealing with a statistical fiction. You’re dealing, in other words, not with the wisdom of crowds, but with the wisdom of statisticians. There’s absolutely nothing wrong with that – from a purely statistical perspective, it’s the right thing to do – but you shouldn’t then pretend that you’re documenting a real-world phenomenon.

Freed gets at this point in a comment he makes on Orzel’s post:

Statistics’ dislike of long right tails is *not a scientific position.* It is an aesthetic position that, at least personally, I find robs us of a great deal of psychological richness … [T]o understand the behavior of a crowd – a real world crowd, not a group of prisoners in segregation – or of society in general, right tails matter, and extreme opinions are over-weighted.

The next time somebody tells you about a wisdom-of-crowds effect, make sure you ask them whether they’re talking about a real crowd or a statistically enhanced crowd.

## 44 thoughts on “The wisdom of statistically manipulated crowds”

1. Nick Carr

But, it doesn’t necessarily follow from this that we shouldn’t look for wisdom-of-crowd effects through the geometric mean or whatever else might work.

Again, I agree entirely.

But let’s also recognize that at that point “wisdom of crowd” becomes, as Philip Klop put it in an earlier comment, “just a catchy label.”

Nick

2. Seth Finkelstein

Nick, sorry, no distortion was intended. We really seem to be approaching this topic from opposite directions. I’m having a very hard time grasping your objection, wrapping my mind around why for you the arithmetic mean or median is fine but the geometric mean is somehow in a different category.

The following is what I PERCEIVE you as saying, my attempt at understanding your objection from your further comments – that in order for a term like “wisdom of crowds” to be valid, one must first proceed from a political conception of the crowd, and only use mathematics in the service of that concept of social relations – it’s not OK to use a mathematical model _a priori_, that’s “cheating” unless you can justify it by a political framework.

That is not how the mathematical-background people are thinking of it. There, “Wisdom of crowds” is a catchy label at *all* points, for “signal-processing/data-analysis”.

This is a bit like the difference between evolution and Social Darwinism.

3. Kevin Kelly

Nick,

“If you went out and posed these questions to experts in the relevant disciplines – Swiss demographics, European geography, and Swiss crime – I’m going to guess the experts would kick the crowd’s collective ass, using whatever statistical measure you want to apply to the crowd’s answers (arithmetic mean, geometric mean, or median). Anybody want to disagree with that?”

I do.

The six questions you list:

1. What is the population density in Switzerland in inhabitants per square kilometer?

2. What is the length of the border between Switzerland and Italy in kilometers?

3. How many more inhabitants did Zurich gain in 2006?

4. How many murders were officially registered in Switzerland in 2006?

5. How many rapes were officially registered in Switzerland in 2006?

6. How many assaults were officially registered in Switzerland in 2006?

…would each require statistical analysis of various sorts to come up with a single figure answer. There would be definitional and measurement issues and disagreements. The methods used would be different that the ones employed by “wisdom” of the crowd, but whether you “believed” them would eventually sit on whether you agreed with their assumptions. ANd from my own experience with “expert” there is no promise that the variability between their different answers would be less than one gets from the crowd. In other words the “wisdom of a crowd of experts” must also go through a statistical process. Is it more trustworthy than the other process? Depends.

4. Jerzy Wieczorek

Nick: “What version of the real world is the geometric mean giving us?”

I hope I can illustrate a case when the geometric mean can be even MORE “real-world” than the arithmetic mean.

If you’re asking a question where all respondents know the order of magnitude (tens, hundreds, thousands, millions, etc), the arithmetic mean is often fine. (Nobody’s going to guess the president’s age is under 10 or over 100.)

However, consider a question where people are really unsure about the order of magnitude. Let’s say you ask three people “What’s the population of China?” (Wikipedia says it’s about 1 billion people.) Suppose each respondent thinks, “It’s gotta be a big number, maybe a million or billion or trillion…” Then they each guess one of those, so your responses are 1,000,000; 1,000,000,000; and 1,000,000,000,000.

Now if you just take the arithmetic mean, you’ll get about 300 billion. That’s way too large.

But the people are just trying to guess the order of magnitude: is it a million or a billion or a trillion? Their *real-world* mental process is basically equivalent to guessing how many 0s there are after the 1. So it might be more appropriate to take the average of the number of 0s in their guesses instead.

If you count the number of zeros and take the arithmetic mean of that…

(6 zeros + 9 zeros + 12 zeros)/3 = 9 zeros

…and then you convert it back to the “number of people” scale, you get 9 zeros -> 1 billion, which IS the right order of magnitude. That’s not a trick or a statistical fiction — it’s just a better model of what’s going on in people’s heads.

Another example: “How many sheets of copy paper were used by our company’s office last year?” Let’s say the answer is 10,000, while the guesses are 1,000; 10,000; and 100,000. Again, the arithmetic mean of the answers is way off (at about 30,000)… but taking the arithmetic mean of the number of zeros leads us to the correct 10,000.

So, when you ask the crowd a question, if you have reason to believe they’re mostly just guessing the order of magnitude, it may be better to take the arithmetic mean of the number of zeros in their answers (rather than of the answers themselves).

Statisticians have realized that this approach is useful sometimes, so they’ve found a handy formula that will calculate this (or a sensible equivalent when the numbers aren’t just 1s with trailing 0s). That is what we call the geometric mean.

It’s not a case of “massaging” or “snipping” the numbers — it’s just closer to what actually goes through people’s heads when answering certain types of questions.

5. Nick Carr

Jerzy,

Re: “But the people are just trying to guess the order of magnitude: is it a million or a billion or a trillion? Their *real-world* mental process is basically equivalent to guessing how many 0s there are after the 1.”

Thanks. That’s a good way of explaining it.

I wonder if the effect of the technique (in some cases at least) is that it serves to select (ie, give greater weight to) those members of the crowd who are actually relatively more knowledgeable about the subject of the question (and hence whose rough guesses will generally be in the ballpark) and weed out the less knowledgeable members (who are making wild guesses). If so, one could say the technique is less about discovering the wisdom of the crowd than it is about discovering where the most real knowledge resides in a crowd.

What do you think?

Nick

6. Nick Carr

Kevin,

I’m not sure what you mean, exactly.

I was saying: go out and ask an expert in each of the relevant subjects, record his/her top-of-mind answer (ie tapping directly into their accumulated knowledge, not their research or statistical skills), and then compare each of the expert answers to the crowd averages. Which would be closer to the truth?

Nick

7. Kevin Kelly

“Which would be closer to the truth?”

In these kinds of sociological assessments the truth is statistical. Will the experts top of mind estimate be closer to the truth? That too is a probabilistic assessment whose conclusion will depend on all kinds of assumptions that can be challenged. And the bigger the pool of data points, the more statistical the truth.

The reasonable person would tend to believe or favor the expert’s biases over non experts’ — but I don’t think there is an absolute “truth” to compare it to (in these data sets). This is a form of trust more than certainty, because experts are too often wrong in their biases to be given certainty.

8. Kevin Kelly

THe questions you listed are almost cannonically indeterminate. For example:

1. What is the population density in Switzerland in inhabitants per square kilometer?

When? What year, what season, what day of the year? Include guest workers? Students? Tourists? No single true answer.

2. What is the length of the border between Switzerland and Italy in kilometers?

Mathematically, the length is infinite. The smaller units you measure the longer it gets. Pick your resolution and you get a different answer, although you can sum the infinite to get a useful estimate — but not a single “truth.”

5. How many rapes were officially registered in Switzerland in 2006?

Rape accusations? Proven rapes? What about those accused in 2006 but not on trail till 2007? What about those accused but charges dropped?

9. Kevin Kelly

YOu said, ” You’re dealing, in other words, not with the wisdom of crowds, but with the wisdom of statisticians. There’s absolutely nothing wrong with that – from a purely statistical perspective, it’s the right thing to do – but you shouldn’t then pretend that you’re documenting a real-world phenomenon.”

I say that the real-world phenomenon is a purely statistical perspective — especially when talking about bunches of people’s behavior. There is not the distinction you are claiming.

10. Nick Carr

Kevin,

I don’t know whether you’re joking or not, but I think those questions are amenable to precise answers. Indeed, here are the “true values” as reported by the researchers for the questions:

184

734

10,067

198

639

9,272

These are the values I would use in making the comparisons.

As for “officially registered in Switzerland in 2006,” I believe there is considerably less ambiguity there than you are seeing, but if you can rephrase the question even more precisely, that would be great.

Nick

11. Brutus.wordpress.com

If the so-called wisdom of crowds never rose above the level of a mundane, gee-whiz observation of a statistical effect drawn from aggregation of data, few would care enough to uncover the flaws and controversy in the methodology, process, or algorithm used to retrieve results, much less the semantic inconsistencies. However, the purported effect has caught our imaginations and folks are attempting to harness it for one purpose or another. (Somebody explain to me the amazing results Google obtains via aggregation. I don’t get it.) So the “wisdom of crowds” is being used in nonnumerical contexts. Specifically, pollsters aggregate data and tell candidates and office holders how to shape their platforms and policies, and every new tech guru seeks ways to Wikify his or her product. The former is a poor proxy for public opinion or leadership, and the latter is merely labor outsourcing to the crowd. Whereas one might argue that a fair final price is settled on via auction, aggregating submissions for a marketing campaign by calling it a contest and awarding a prize to the winner is really just exploiting free labor. Sometimes the final result is superior, but as with Wikipedia, many times the results are mixed or worse.

12. Kevin Kelly

“I don’t know whether you’re joking or not, but I think those questions are amenable to precise answers. Indeed, here are the “true values” as reported by the researchers for the questions”

I am not kidding. I don’t believe the “true values” they offer are true. I would be willing to bet you could find a local expert with a value that would disagree with each of them.

I am surprised you believe they are “true.” What are you basing your faith on? Have you seen how they calculated? Or is it because they call themselves “expert”?

13. Nick Carr

What are you basing your faith on?

It’s trust, not faith. I’m assuming that the authors of the paper (who are not themselves experts in geography or criminology) consulted reliable published sources. If my life depended on the accuracy of the numbers, I’d certainly double-check them, but I’m pretty confident that what I’d discover would be consistent with what they discovered. I have no problem with the idea of “facts.”

14. Cedar Riener

This is a really fascinating and illuminating conversation (both at Freed’s place and here). There is a real thrill to seeing some of my writerly heroes duke it out. The internet may be making us stupid, but not me. :)

I wanted to add my point of view by agreeing with Jerzy, and highlighting one of his points, which I think is a key to some of the various disagreements (and also a way of showing how this conversation is about the nature of science, not just about the wisdom of crowds).

On one had, which certain central tendency we use to describe a sample of people (or whatever) is arbitrary; the mean is no more simple or “true” than the median. On the other hand, the decision of which one to use is a judgment call by the scientists in question, and not at all arbitrary. It is subject to their background knowledge in the thing being measured, as well as the distribution of the data itself. If the data is skewed, the mean might not be the best description (when Bill Gates is in a room, everyone is on average a millionaire, if you are using a mean). But as Jerzy points out, this also has to do with a judgment of what is actually going on, in this case the psychology of the decision. It may turn out to be more “accurate” to use mean for some of the questions and median for others. I could see the distribution of murder estimates being much more skewed than the distribution of border estimates, for example.

Part of what I like about this conversation is that many here are getting at is that the decision of which central tendency to use is a function of a certain amount of expertise. Not just simple statistical know-how (although that does get you part of the way there) but also an understanding of the individual decision process.

No central tendency is, in general, truer or more accurate than any other, any more than a two party system is more accurate than a 3 party. Most scientists (especially social scientists) have to give up pretty quickly on whether they are assessing any real thing, and just try to tie their measurements to other measurements. There is no bedrock, just a web of connections. Whether it is wisdom, happiness, or intelligence, we give up on whether it is the real wisdom, and just try to find whether the particular wisdom measured here does anything when you poke it.

Anyways, thanks for the forum for this really insightful discussion. Hopefully this comment wasn’t too far below the mean? median?

15. Jerzy Wieczorek

Thanks for the comments, Nick and Cedar.

Nick: you had said using the geometric mean is “an attempt to *retrofit* a statistical technique chosen purely for its statistical efficacy back to a real-world crowd” — I hope you see that’s not the case after all.

Of course, it happens that scientists try 20 models and then choose to present the one that fits best, and this can be a big problem, i.e. bad science.

But in *this* case, there are solid a priori reasons to believe the geometric mean SHOULD fit better than the arithmetic mean… and so it’s *good* science that they include this comparison in the article.

“the skewedness of the distribution tells us something important about the wisdom of the crowd (and its limits)”

Yes — it tells us that the crowd is not “wise” on the count scale (1, 2, 3…), but they may be “wise” on the order-of-magnitude scale (10, 100, 1000…).

If that’s the case, i.e. if the data are skewed, then the arithmetic mean on the order-of-magnitude scale (i.e. the geometric mean on the count scale!) is likely to pick up any existing “wisdom”, while the arithmetic mean on the count scale is not.

“I wonder if the effect of the technique … serves to select … members of the crowd who are actually relatively more knowledgeable … and weed out the less knowledgeable members … If so, one could say the technique is less about discovering the wisdom of the crowd than it is about discovering where the most real knowledge resides in a crowd.”

No, I don’t think so… These metaphors (median as polity, mean as collective wisdom) are fine for making your audience quickly *feel* like they understand, say at a cocktail party. But they certainly aren’t good metaphors for *actual* understanding of the mean and median, and they probably aren’t good metaphors for understanding polity and collective wisdom either. The two fields are simply not doing the same things. It’s apples and oranges.

16. Jerzy Wieczorek

Kevin said: “I don’t think there is an absolute “truth” to compare it to”

Nick said: “Indeed, here are the “true values” as reported by the researchers…

I’m assuming that the authors of the paper … consulted reliable published sources…

If my life depended on the accuracy of the numbers, I’d certainly double-check them…”

I don’t think Kevin’s point is about double-checking whether the authors looked up the right table of numbers.

Rather, somebody had to make assumptions when creating that table of numbers in the first place:

1) First, these questions would have to be worded more precisely. For example, for the Swiss population density question, you HAVE to answer Kevin’s questions (on what date? including tourists? students? etc.) in order to have a single answer.

Otherwise it just isn’t a well-defined question, simple as that.

If you ask someone this ill-defined question, and they tell you a guess of 190, you can’t just look at the “184” on your list of answers and say they’re wrong.

Maybe 184 is “right” for “What was the population density of Swiss citizens residing at home, on April 1, 2010?” while 190 is “right” for “What was the pop density including all citizens, tourists, students, undocumented workers, etc., on Dec 31, 2010?”

2) As a practical matter, even if the questions are well defined, there may be randomness in the process of getting the answer, AND experts may disagree about the best way to find it.

Again with the Swiss population density: Say we’ve decided to count all people (citizens or otherwise) in Swiss territory on a certain date, and we hold a census by mail on that date.

You’ll miss the undocumented immigrants who hide from government officials; people who just forget to fill out the form; tourists without a Swiss address; etc., and you’ll overcount people who make random mistakes on their form, etc. etc. etc…

Expert statisticians designing the census may not agree on the best way to handle all of these cases. Some might choose to trust whatever forms are received; some might add phone callbacks (with their own issues); some might want to adjust statistically for groups that are known to be undercounted in censuses…

Counting populations is a mess!

Based on all of these factors (some decision-based, some random), the experts compiling this data could easily have gotten a number different from 184.

Maybe not too far different — but certainly different.

So even if the questions is well-defined enough to HAVE a true answer, that answer might be impossible to get.

In short, we may not have an absolutely true number to use when comparing people’s guesses.

It’s good to be aware that these kinds of “facts” are based on a really long chain of assumptions and messy data collection. Even if nothing goes wrong and the final estimate happens to be *close* to the truth, it’s still just an estimate.

17. Sergio Montoro Ten

It does not make any sense to do these experiments. Ask people to estimate the distance from the Earth to the Moon, you will get a very wrong answer from the crowd. It is not because, on average, they are stupid, but due to the fact that the human brain has trouble when dealing with absolute quantities and only works well when it can compare a relative figure against another. Evolution gave humnkind a tool for counting up to 10. For questions which answer involves counting beyond that upper bound, we invented the computers.

18. Stewart Dinnage

No one may ever read this including Nick, but I have to take issue with one of the comments.

————–Nick said———————

I wonder if the effect of the technique (in some cases at least) is that it serves to select (ie, give greater weight to) those members of the crowd who are actually relatively more knowledgeable about the subject of the question (and hence whose rough guesses will generally be in the ballpark) and weed out the less knowledgeable members (who are making wild guesses). If so, one could say the technique is less about discovering the wisdom of the crowd than it is about discovering where the most real knowledge resides in a crowd.

What do you think?

———————————————–

This was a response to someone suggesting the issue of arithmetic mean from crowds not appearing wise may be related to problems where magnitude is not well known.

In an earlier post Nick said he understood the reason geometric means were used when analysing left or right sided non normal distribution of choices.

If that is the case why make the statement above?

Clearly regardless of “dimness” about magnitude in a normal distribution of crowd choices that dimness might cancel out, it was said just a couple of posts earlier that this was understood, no?

I did like the discussion of democratic wisdom versus Average wisdom. My personal though on the wisdom of a skewed choice angle is, there is little wisdom (crowds or researchers) in summarising a crowds opinion as equal to an average when the left is limited but the right un bound. Everything that ignores (or selectively remembers) this and tries to suggest it as some proof of a flaw in the concept of crowd wisdom, generally comes across as biased individuals expressing opinions. :-)