THE PSYCHOTIC STATE Number 8 April 30, 1999 Vox Populi? There are, it is said, three categories of deception: lies, damned lies, and statistics. My current state of psychosis was brought on by the web's ability to generate so much of all three kinds. Humans have been lying since they invented speech - before probably - but the abuse of statistics is of course a much newer phenomenon. The web is giving it a new - and baleful - boost. Americans seem particularly in love with numbers - witness the nearly hourly public opinion polls of the President's job approval ratings, or ask the average 13-year-old baseball fan the ERA of the local team's star pitcher. I have always found this odd contrast to our society's overall innumeracy: we have one of the worst math and science education records in the industrialized world. (I once saw a college student use a calculator to multiply a number by 10.) The national mania to reduce everything to numbers, even meaningless ones, has given rise to a fad: putting surveys on web sites. Even some of the biggest sites do it, for example, the CNN poll. This is prominently displayed on their home page, in the top, expensive screen real estate. I am always fascinated by it, not because it tells me much about what the world is thinking, but because of what it tells me about CNN. For example, during the Monicagate shenanigans, the results of the poles on the CNN web site were always substantially different from those I heard on CNN and other news sources. Generally, the web site survey was more "Republican" than the scientific studies. What I deduce is that CNN was happy to leave on a page which ostensibly covers news an item that told the site's visitors what they wanted to hear - even if it was wrong. Recently, Tim and Paul and the good folks of the Applied Concepts Lab came up with Agency.com's contribution to this trend - our online surveyer. This month's diatribe is not meant to warn us off from using this new tool. Rather, it's designed to inform, so that we don't abuse the numbers that come out. The next several paragraphs are a bit like a little math course. Don't worry too much about that; I'm not planning to make matters too complicated. I thought I'd review what a survey is, and how to interpret it. Suppose we want to know the answer to the question "What fraction of Americans like Twinkies?" The only certain way to get the answer is to go ask every American if s/he likes Twinkies. Dividing the number of yes answers by the total number of responses gives the answer. Of course, no one can afford to do it this way. So instead we pick a random subset of people to ask. We'd like to think that if we ask enough people, the percentage we get will eventually come to mirror that of the population as a whole. Let's suppose that the overall population was precisely evenly divided. Half like Twinkies, half don't. If we went out and asked 10 people whether they liked Twinkies, would we be surprised if 60% said yes? Of course not - 6 out of 10 is actually pretty close to half. On the other hand, if we asked 1000 people and 60% still said yes, we would start to doubt our belief that half of Americans like Twinkies and half don't. There is a way of making this all concrete. In taking a survey, we are doing a sampling of the overall population. Since the people are chosen at random, the results are not identical each time we run the survey. Skipping lots of math, we can say that the results will vary a bit because of the randomness in the procedure. There is a measure of this variability, called the "standard deviation", that tells us how much the results will vary. Roughly speaking, two thirds of the time the result should be within one standard deviation of the correct result, one-third of the time outside. An example will perhaps clarify. If we take a random sample of N people in a population in which a fraction P like Twinkies (and therefore 1-P don't like them), the standard deviation is given by std dev = sqrt(N P (1-P)) If the fraction of the population that likes Twinkies is half (.5) and the number of people we sample is 10, sqrt((10)(.5)(.5)) = 1.6. So we are not surprised when we find that 6 people said they liked Twinkies. The difference between 6 - what we got - and 5 - the "right" answer - is only 1, which is less than one standard deviation away. On the other hand, when we sample 1000 people, the standard deviation is about 16, so finding 600 people saying they like apples rather than 500 would be _very_ unlikely, if everything is random. One last point: sometimes the population is not uniform in its preferences. In that case we can actually do even better. For example, suppose that it turned out that everyone west of the Mississippi liked Twinkies (there's a big factory in San Francisco), but everyone in the east hates them. Then we should actually do _two_ surveys and combine the results: find out what percentage like Twinkies in the east (near 0%) and in the west (near 100%), and then use the _census_ to weight the combined answer. You see, we don't have to determine the relative populations of the east and the west - the census numbers are far more accurate than what we'll deduce by our small random sample. Since, under this scenario, the percentage of people who like Twinkies is the same as the fraction of people who live in the west, we can get a really accurate answer using the census. OK, enough math. What's this all got to do with the web survey tool? Well, what goes wrong on CNN's site is that the people who answer the poll are _not_ randomly selected. The sample is skewed for at least two reasons: (a) the people who are on the web are not a random subset of the population as a whole, and (b) the people who decide they want to answer the poll are self selecting, so people who feel strongly about the issues are more likely to vote than those who don't. Given that the web population is likely to be richer, whiter, and maler than the population as a whole, (a) is likely to lead to a more "Republican" view of Monicagate than a scientific poll. I would argue that (b) is likely to exacerbate this effect as well. If we insist on trying to do a survey on the web, one thing we can do is ask people what their age, race, sex, income, home residence, etc are. Then we may be able to rebalance the population, to try to eliminate systematic effects of (a). Of course, it may be tough to ask all those questions; it depends on the circumstances I guess. Equivalently, we can restrict ourselves only to gathering information about the correlations between questions, rather than try to find out absolute numbers. For example, we can say that of the people who like Twinkies, 80% also like HoHos. We might believe the web survey doesn't really reveal correctly how many like Twinkies or how many like HoHos, but the ratio is more or less correct. There are still doubtless systematic effects of (a) and (b), but they are probably reduced when one divides one poll result by another. Of course, we can just be psychotic, and use the poll the way CNN does: strictly for entertainment, and otherwise full of it.