The diminishing returns on data

CNet’s Tom Krazit has posted a brief but very interesting interview with the Berkeley economist Hal Varian, who now serves as one of Google’s big thinkers. Krazit asks Varian whether search scale offers a quality advantage – in other words, does the ability to collect and analyze more data on more searches translate into better search results and better search-linked ads. Here’s the exchange:

Krazit: One thing we’ve been talking about over the last two weeks is scale in search and search advertising. Is there a point at which it doesn’t matter whether you have more market share in looking to make your product better?

Varian: Absolutely. We’re very skeptical about the scale argument, as you might expect. There’s a lot of aspects to this subject that are not very well understood.

On this data issue, people keep talking about how more data gives you a bigger advantage. But when you look at data, there’s a small statistical point that the accuracy with which you can measure things as they go up is the square root of the sample size. So there’s a kind of natural diminishing returns to scale just because of statistics: you have to have four times as big a sample to get twice as good an estimate.

Another point that I think is very important to remember … query traffic is growing at over 40 percent a year. If you have something that is growing at 40 percent a year, that means it doubles in two years.

So the amount of traffic that Yahoo, say, has now is about what Google had two years ago. So where’s this scale business? I mean, this is kind of crazy.

The other thing is, when we do improvements at Google, everything we do essentially is tested on a 1 percent or 0.5 percent experiment to see whether it’s really offering an improvement. So, if you’re half the size, well, you run a 2 percent experiment.

So in all of this stuff, the scale arguments are pretty bogus in our view…

This surprised me because there’s a fairly widespread assumption out there that Google’s search scale is an important source of its competitive advantage. Varian seems to be talking only about the effects of data scale on the quality of results and ads (there are other possible scale advantages, such as the efficiency of the underlying computing infrastructure), but if he’s right that Google long ago hit the point of diminishing returns on data, that’s going to require some rethinking of a few basic orthodoxies about competition on the web.

I was reminded, in particular, of one of Tim O’Reilly’s fundamental beliefs about the business implications of Web 2.0: that a company’s scale of data aggregation is crucial to its competitive success. As he recently wrote: “Understanding the dynamics of increasing returns on the web is the essence of what I called Web 2.0. Ultimately, on the network, applications win if they get better the more people use them. As I pointed out back in 2005, Google, Amazon, ebay, craigslist, wikipedia, and all other Web 2.0 superstar applications have this in common.” (The italics are O’Reilly’s.)

I had previously taken issue with O’Reilly’s argument that Google’s search business is characterized by a strong network effect, which I think is wrong. But Varian’s argument goes much further than that. He’s saying that the assumption of an increasing returns dynamic in data collection – what O’Reilly calls “the essence” of Web 2.0 – is “pretty bogus.” The benefit from aggregating data is actually subject to decreasing returns, thanks to the laws of statistics.

That doesn’t mean that data scale wasn’t once crucial to the quality of Google’s search results. The company certainly needed a critical mass of data – on links, on user behavior, etc. – to run the analyses necessary to deliver relevant results. It does mean that the advantages of data scale seem to go away pretty quickly – and at that point what determines competitive advantage is smarter algorithms (ie, better ideas), not more data.

14 thoughts on “The diminishing returns on data

  1. David Evans

    People do tend to forget that at some point you hit these effects – entropy, diminishing returns, or similar. “The only people who think you can have infinite growth in a finite world are madmen and economists”. I was interested to see an article in the Guardian ( that in the print version had a plot of Wikipedia articles over time that showed a standard s-curve of growth. Everything will at some point hit a constraint – it’s just the way the universe is. There is, however, something about human perceptions that lead us to think this is not the case. We assume that massive scientific progress will lead to flying cars and elimination of poverty or the Internet will put the sum of human knowledge at the fingertips of every child. Instead we get newer, faster computers that can do more…but somehow seem to take longer….!

  2. Evan Goer

    Varian is right, there’s plenty of data for running search algo bucket tests.

    The part of the story he’s leaving out is advertising inventory.

  3. Carlos Leyva

    I think that this scale argument is right on point. At a certain point how much data you have doesn’t really matter any more. There are limits to what “raw search technology” can do and Google, being the math heads that they are, have realized that a long time ago.

    Essentially, this shifts the competition to a different space, not just for Google but for lots of other players. Search is part of the solution but only a part. Findings way to add value to data opens up huge spaces across any number of verticals.

    It is still early in this game, like maybe 1990 in a PC history time frame. Much of what is interesting will NOT happen directly within the tech sector itself. It is now all about the use of information and not its acquisition.

  4. Seth Finkelstein

    Well … yes and no.

    I mean, it’s obviously true that there’s diminishing returns to any data analysis. That’s trivial, even if some technohucksters flack the opposite. But …

    You’ve got to remember Google’s political context here of “We’re not a monopoly, nope, nope, no monopoly issues due to our size, not at all, IT’S SCIENCE, so the anti-trust regulators shouldn’t be looking at us, pay no attention to the market share …”.

    The tip-off is in his remark “So the amount of traffic that Yahoo, say, has now is about what Google had two years ago. So where’s this scale business? I mean, this is kind of crazy”

    That’s basically trying to convey Google doesn’t have an antitrust relevant advantage over Yahoo due to relevant size. That’s a different concept.

  5. David Evans

    Seth, that’s confusing a technical advantage with a market advantage. The issue at hand is whether the quality of Google’s search is meaningfully higher because of the scale of their data. The market power they have because of their customer base, eyeballs and adjacent web services is something else. Combining this with other discussions about how data centre real estate investments could become a drag just means that Google could be overtaken technically more easily than we might otherwise realise. However, history shows the relative technical merit of an offering to be barely of consequence (within limits) when it comes to use and abuse of significant market power.

  6. Tom Lord

    I am not certain but I think that there is a pretty disturbing way in which you are dead wrong.

    I have no trouble at all believing that search result quality grows, let’s say, logarithmically with scale. Makes perfect intuitive sense.

    Adds are a different matter and I call your attention back to Google’s acknowledgment that they are going into the “behavioral tracking” business.

    Behavioral tracking means, among a lot of other unpleasant things, you and I see different ads in any context where the tracking is applied to place ads. You and I both sit across the room from a virtualized and malevolent version of B. F. Skinner and we are both treated to the particular stimuli that the numbers show are most likely to provoke a conditioned response in us. The “behavioral tracking ad placement” biz is essentially a race to find the most addictive personally tailored algorithm for influencing our behaviors.

    Assuming that the personally tailored ad placements that result from behavioral tracking are actually better than other ad placement techniques, then the value of search grows *at* *least* *linearly* with scale.

    What Varian is saying here suggests that Google thinks behavioral-based ad placement grows in value at (rule of thumb) some linear-times-a-logarithm rate. So, twice as many users means the value is 2*log(2).


  7. Tony Healy

    Varian’s response is disingenuous. The search market is not homogenous, but consists of hundreds of thousands of niches, for some of which data might be sparse.

    In those cases, Google’s access to a larger total results pool can be the difference between successfully analysing a lucrative niche and being swamped in noisy data.

    Secondly, there’s opportunity in identifying new trends quickly and handling them. Google’s greater market reach means it gains earlier insights into niches and trends, and that can translate into competitive advantage.

  8. YankeeSoccerNut

    Healy’s point is spot on IMHO. Growth comes from identifying and capitalizing on niche opportunities quicker than the competition. Gain a reputation for providing this capability better than your competition (e.g. Google v. Yahoo/MSFT) and you win a larger portion of the advertising dollars — starving out the competition and fueling your own growth. You get more search traffic enabling you to target new niches and the cycle continues…

  9. Linuxguru1968

    It’s kind of odd to hear a economist discussing the graph of amount of data vs. market share as a single parabolic relationship reaching some asymptotic line. Most economists think in terms of cyclical relationships like sine waves. You would expect that scale vs. market would also be cyclical phenomenon as well. I just can believe that it is a static as he leads us to believe.

    Even as raw byte count increases, the ability to store is far outstrips the rate at which it is created, a negative factor effecting raw scale. As algorithms become more sophisticated over time, large chunks of bytes become simplified and re-categorized, in to more manageable formats, another negative factor in the equation. Rather than just a square function, you would expect the actual relationship to be a complicated cyclical function over time.

    Somewhere in all the data of search, purchases, page hits and etc. must be a multitude of cyclical events related to social trends that would be of interest to any company wanting to boost market share. I think Mr. Varian is hanging out too much with the wire heads at Google!

  10. Joris van Hoboken

    I think Seth Finkelstein is right. It’s antitrust pr. I would add that Hal Varian is not an economist but a Google employee when he speaks about the economics of search.

    Almost everything and in particular innovation at Google is a data driven. It’s pretty obvious that scale matters in many ways. Varian’s arguments, quoted above, are not even denying that. They state: 1. scale matters but not as much as they say it does. 2. Scale matters but others are also growing. 3. Scale matters but there are also fish in a small pond.

    I agree that there might be a lot of other factors which contribute to Google’s market share of users, such as their brand, their advertiser network, their understanding of search as a product, the quality of the global technical infrastructure their product is built on, the quality of their employees, etc., etc.. (Note that many of these factors are also positively related to their market share of users)

    But Google has never released an independent study of the use and value of user data for the improvement of search quality (such as the Tuzhilin report on click fraud, Until they do that, possibly forced by a judge or a government agency, I do not think this particular discussion will reach a satisfactory conclusion.

  11. Nicholas valley

    I agree that yahoo is not so enough..The current issue associated is whether the quality of Google’s search is totally reliable because of having a large storage containing spam also.. Combining this with other discussions about how data centre real estate investments could become a drag just means that Google could be overtaken technically more easily than we might think on it.

    Nicholas Valley

  12. Jim Mason

    A few things to note –

    1. More relevant than the square root concept are these two facts

    a) Google isn’t using all of its data to fine tune their algorithm.

    b) Yahoo is where Google was 2 years ago in terms of query volume.

    So Yahoo has way more than all the data it needs.

    2. More importantly: it is not an undisputed fact that Google’s search results are better than it’s rivals. It is a branding game now and a matter of psychology. There is years of confidence that Google has instilled into the minds of people. Anyone who switches their search engine today will be going crazy with the voice in the back of their head telling them that Google may have better results.

    Grass is greener on the other side in general but this is a one way Google effect. It’s like preferring expensive wine even if you know for sure a cheaper wine tasted better the last time you tried it.

    Here’s my claim – even if Google completely stops working on improving its search results, its market dominance will continue to grow for quite a while.

Comments are closed.