Let’s start with a simple premise: The more often something happens, the more often people write about it.
Sounds reasonable, right?
Albert Saiz and Uri Simonsohn, in their article “Downloading Wisdom from Online Crowds,” demonstrate that the relative frequency of documents returned by a search engine can be a good measure of how frequently a phenomenon occurs. For example, if you want to know the relative cost of living in all U.S. cities (or the relative amount of corruption, or perhaps even how good the golfing is) then simply searching for “Dallas cost-of-living” and “San Francisco cost-of-living” may give you a great index. If it works as advertised, this is a fantastic general research tool for analysts and marketing researchers. Let’s take a look.
First, let’s start off by explaining why this is such a great tool: it gives us a new way to measure the world around us that’s much faster and cheaper than other methods, such as survey research. Using this new method, the authors were able to put together the first comprehensive measure of corruption at the city level, because previously it had been too expensive to conduct surveys or do field research in every major U.S. city. Suddenly, that information is now at our fingertips.
To get started, we’ll need a search engine that accurately estimates the number of documents returned from a given search… unfortunately, document counts from Google and Yahoo aren’t very reliable. Saiz and Simonsohn suggest Exalead as one of the most accurate search engines in terms of document counts.
I decided to take this for a spin and compare the relative corruption of Sweden and Russia, using the Exalead search engine. Although I wasn’t sure what to expect in terms of relative frequency, my search did show Russia having a corruption score roughly 2.5 times that of Sweden — a result well within the bounds of believability.
Keywords: “Sweden Corruption” | 181,145 | |
Keywords: “Sweden” | 17,881,209 | |
181,145 / 17,881,209 | = | 0.01013 |
Keywords: “Russia Corruption” | 673,531 | |
Keywords: “Russia” | 25,393,311 | |
673,531 / 25,393,311 | = | 0.02652 |
Keep in mind, of course, that it’s the relative frequency that’s useful here… knowing that 1% of Web documents containing the keyword “Sweden” also refer to “corruption” is interesting, but its real utility comes from comparing it to something else. Overall, this looks like an incredibly interesting tool for social science researchers and just about anybody who needs to conduct market research on a small budget.
I first learned of this new research method listening to Dan Ariely’s podcast, Arming the Donkeys. Dr. Ariely is the author of Predictably Irrational, kind of a “Freakonomics for Business” that’s become mandatory reading for marketers everywhere. I highly recommend checking out the podcast — if you like the podcast, buy the book.
Warning: Everyone except die-hard researchers should stop reading here. I have posted some technical considerations below so that this post will serve as a good implementation reference, but otherwise I don’t recommend slogging through them.
>>
>>
As with any applied statistical tool, there are a number of inherent assumptions… fortunately, Saiz & Somonsohn offer a series of data checks that make it reasonably easy to apply this method:
- Do the different document queries maintain phenomenon and keywords constant? Imagine we’re comparing the search terms “car crash” and “plane crash”: the appeal to write about one term may be significantly different than the other, and the percentage of documents about automobile accidents containing the keywords “car crash” may differ from the percentage of documents about airplane accidents containing the keywords “plane crash.” Therefore, queries should try to maintain both the phenomenon and the keyword constant (for example, “Sweden corruption” and “Russia corruption.”)
- Is the variable of interest a measurement of frequency? “Corruption” and “high-school dropouts” are examples of frequency measures, while “education” and “dating” don’t necessarily imply frequency interpretations.
- Is the keyword of interest used predominantly to discuss the occurrence rather than he non-occurrence of the phenomenon? For example, either an increase or a decrease in education levels can result in more documents containing the keyword “education.”
- Is the average number of documents found large enough to be driven by factors other than sampling error? Document counts as low as 50 can indicate reliable results; however, if your searches are returning fewer results than that, your results may not be reliable.
- Is the expected variance in your topic of interest high enough to overcome the sampling noise inherent in this measurement approach? For example, the variance in cancer rates across states is very low, while the variance in cancer rates across countries is relatively high.
- Does your keyword have as its primary or only meaning the occurrence of your phenomenon of interest? Keywords may have multiple meanings, which can skew your results. The keyword “Blacks” can have multiple meanings, for example, while “African Americans” is much less ambiguous.
- Does searching for your keyword also result in other phenomena that are related to what you’re interested in measuring? That is, the tendency to write about guns may increase with both the availability of guns and crime and the use of guns and crime. If a problem is detected, crafting a more specific search (e.g., “gun shows” or “guns NOT murder NOT crime”) should solve it.
- Are there plausible omitted variables that are correlated with the document-frequency of your topic? For example, more cosmopolitan cities may be more likely to discuss societal issues. If you believe this may be the case, you can control for the omitted variable by substituting another document-frequency search.