THE PSYCHOTIC STATE
Number 8
April 30, 1999
Vox Populi?
There are, it is said, three categories of deception: lies, damned
lies, and statistics. My current state of psychosis was brought on by
the web's ability to generate so much of all three kinds. Humans have
been lying since they invented speech - before probably - but the
abuse of statistics is of course a much newer phenomenon. The web is
giving it a new - and baleful - boost.
Americans seem particularly in love with numbers - witness the nearly
hourly public opinion polls of the President's job approval ratings,
or ask the average 13-year-old baseball fan the ERA of the local
team's star pitcher. I have always found this odd contrast to our
society's overall innumeracy: we have one of the worst math and
science education records in the industrialized world. (I once saw a
college student use a calculator to multiply a number by 10.)
The national mania to reduce everything to numbers, even meaningless
ones, has given rise to a fad: putting surveys on web sites. Even
some of the biggest sites do it, for example, the CNN poll. This is
prominently displayed on their home page, in the top, expensive screen
real estate. I am always fascinated by it, not because it tells me
much about what the world is thinking, but because of what it tells me
about CNN. For example, during the Monicagate shenanigans, the
results of the poles on the CNN web site were always substantially
different from those I heard on CNN and other news sources.
Generally, the web site survey was more "Republican" than the
scientific studies. What I deduce is that CNN was happy to leave on a
page which ostensibly covers news an item that told the site's
visitors what they wanted to hear - even if it was wrong.
Recently, Tim and Paul and the good folks of the Applied Concepts Lab
came up with Agency.com's contribution to this trend - our online
surveyer. This month's diatribe is not meant to warn us off from
using this new tool. Rather, it's designed to inform, so that we
don't abuse the numbers that come out.
The next several paragraphs are a bit like a little math course.
Don't worry too much about that; I'm not planning to make matters too
complicated. I thought I'd review what a survey is, and how to
interpret it.
Suppose we want to know the answer to the question "What fraction of
Americans like Twinkies?" The only certain way to get the answer is
to go ask every American if s/he likes Twinkies. Dividing the number
of yes answers by the total number of responses gives the answer.
Of course, no one can afford to do it this way. So instead we pick a
random subset of people to ask. We'd like to think that if we ask
enough people, the percentage we get will eventually come to mirror
that of the population as a whole.
Let's suppose that the overall population was precisely evenly
divided. Half like Twinkies, half don't. If we went out and asked 10
people whether they liked Twinkies, would we be surprised if 60% said
yes? Of course not - 6 out of 10 is actually pretty close to half.
On the other hand, if we asked 1000 people and 60% still said yes, we
would start to doubt our belief that half of Americans like Twinkies
and half don't.
There is a way of making this all concrete. In taking a survey, we
are doing a sampling of the overall population. Since the people are
chosen at random, the results are not identical each time we run the
survey. Skipping lots of math, we can say that the results will vary
a bit because of the randomness in the procedure. There is a measure
of this variability, called the "standard deviation", that tells us
how much the results will vary. Roughly speaking, two thirds of the
time the result should be within one standard deviation of the correct
result, one-third of the time outside.
An example will perhaps clarify. If we take a random sample of N
people in a population in which a fraction P like Twinkies (and
therefore 1-P don't like them), the standard deviation is given by
std dev = sqrt(N P (1-P))
If the fraction of the population that likes Twinkies is half (.5) and
the number of people we sample is 10, sqrt((10)(.5)(.5)) = 1.6. So we
are not surprised when we find that 6 people said they liked Twinkies.
The difference between 6 - what we got - and 5 - the "right" answer -
is only 1, which is less than one standard deviation away. On the
other hand, when we sample 1000 people, the standard deviation is
about 16, so finding 600 people saying they like apples rather than
500 would be _very_ unlikely, if everything is random.
One last point: sometimes the population is not uniform in its
preferences. In that case we can actually do even better. For
example, suppose that it turned out that everyone west of the
Mississippi liked Twinkies (there's a big factory in San Francisco),
but everyone in the east hates them. Then we should actually do _two_
surveys and combine the results: find out what percentage like
Twinkies in the east (near 0%) and in the west (near 100%), and then
use the _census_ to weight the combined answer. You see, we don't
have to determine the relative populations of the east and the west -
the census numbers are far more accurate than what we'll deduce by our
small random sample. Since, under this scenario, the percentage of
people who like Twinkies is the same as the fraction of people who
live in the west, we can get a really accurate answer using the
census.
OK, enough math. What's this all got to do with the web survey tool?
Well, what goes wrong on CNN's site is that the people who answer the
poll are _not_ randomly selected. The sample is skewed for at least
two reasons: (a) the people who are on the web are not a random subset
of the population as a whole, and (b) the people who decide they want
to answer the poll are self selecting, so people who feel strongly
about the issues are more likely to vote than those who don't. Given
that the web population is likely to be richer, whiter, and maler than
the population as a whole, (a) is likely to lead to a more
"Republican" view of Monicagate than a scientific poll. I would argue
that (b) is likely to exacerbate this effect as well.
If we insist on trying to do a survey on the web, one thing we can do
is ask people what their age, race, sex, income, home residence, etc
are. Then we may be able to rebalance the population, to try to
eliminate systematic effects of (a). Of course, it may be tough to
ask all those questions; it depends on the circumstances I guess.
Equivalently, we can restrict ourselves only to gathering information
about the correlations between questions, rather than try to find out
absolute numbers. For example, we can say that of the people who like
Twinkies, 80% also like HoHos. We might believe the web survey
doesn't really reveal correctly how many like Twinkies or how many
like HoHos, but the ratio is more or less correct. There are still
doubtless systematic effects of (a) and (b), but they are probably
reduced when one divides one poll result by another.
Of course, we can just be psychotic, and use the poll the way CNN
does: strictly for entertainment, and otherwise full of it.