| Some fluff, some info |
|
navigational aids:
|
28 August 04. I (heart) factor analysis! OK, I've made the comment often enough that it's not fun anymore, but stats textbooks often blur the distinction between descriptive procedures and hypothesis testing procedures, and this creates endless woes in the world---notably, researchers are often taught to do exploratory stuff on data and then test the hypotheses that they'd just cooked up using the same data, which adds to human knowledge in some ways, but the hypothesis test is completely and totally invalid, with 100% certainty. Won't harp now; have blogged about this already. There's something of a focus on hypothesis testing with most of this stuff since, again, it's the hypothesis test that gets you published. So this little column is to remind the reader of one of those things that gets left by the wayside: factor analysis (aka singular value decomposition or principal component analysis). It's more advanced than the basic descriptive stats (mean, variance), but not a hypothesis test, so in my own limited experience it seems to get left by the wayside. Before I lose every last one of you, here are two super-hip examples of why factor analysis is fun. The first is Poole & Rosenthal's analysis of the U.S. Congress. They found that about 90% of the variance in voting patterns can be described in two dimensions, so that can be made into a graph; string them together, and you have a movie of the U.S.'s political history. Second example: Eharmony.com has a system to match you with someone you're most compatible with, and how do they do it? Factor analysis, of course. They give potential love birds a survey of a few hundred questions (oh, the tedium!) and then map the test-taker in a few artificial dimensions. They find people who are close in the artificial space, and have them go out on dates. Romance blossoms. [It would be rude of me to encourage the reader to use factor analysis without mentioning that Eharmony has patented their matchmaking algorithm. Therefore, when you do your factor analysis, you must be careful that you don't assign variable names that sound too much like `sexual passion', `spirituality', or 27 other such terms---if you do, then you are in violation of the patent and could be sued for damages. As of this writing, most other names one could assign to the dimensions resulting from a factor analysis are still in the public domain, and can be used without a licensing fee.] Oh, and my other favorite is a paper by Moore et al (citation available on request) about color perception: for sighted people, factor analysis neatly puts the perception of words such as `red', `blue', and `yellow' in two dimensions, in a circle---a color wheel. Factor analysis of responses to the same words by the blind fall on one dimension, basically ranging from bright to dark. And thus, factor analysis shows us what the blind see. There are two ways you could go about it, I guess: the first is to say, `I have no idea how the data was generated, but darn it, I want a picture', in which case a two-dimensional factor analysis fits the bill wonderfully, and once you get the picture, you just might learn something (even if it can't be formally tested). The other is to say `I really think there are some latent variables driving the variables I've observed', and then factor analysis may again save the day, by showing you the best linear combination of existing variables to suggest what those latent variables may be. Both of these sorts of behavior are exactly what statistics is really about, and think they're a great thing to try on any given data set. This all comes up because I have some data on how junior high kids smoke. I have zero information about the kids, not even their gender, since the guy who collected the data wants to play with it more and get some more publications out before he puts it out for others to use. So I'm not assuming that there are important latent variables underlying what data we have---I know that there are important missing variables underlying the data we have. The factor analysis did a very neat job of pulling out what I reasonably believe are the most important underlying characteristics, thus saving the day. [The hypothesis tests, by the way, are about ten steps further down the line in the project, so no, no worries about confounding the two. Won't bore you with the details.] So don't forget to use factor analysis on your favorite data set, and maybe write a paper or two about the results, since good results from a good factor analysis really can teach the world stuff. Also, next time you're asked to review a paper, please remember that good descriptive stats can stand by themselves without being bolstered by a table of statistically sound but procedurally bogus linear regressions. After all, the true level of explanatory power and persuasiveness of a paper just isn't measured by confidence intervals.
|
12 February 05. Apophenia Just a little note that I've finally put up my library of stats functions for public consumption and modification.
But I still work in C, and I still do statistics, and there are other people out there just like me, I just know it. My desires are supremely simple: I just want a good, reliable toolbox. I'm not a visual person, and rarely do exploratory stuff in the way of just meandering through a data set. Usually, I have a specific question and want a nice, precise answer. That is, I want to apply a specific tool to the data. The cute user interface of the typical stats package is also wasted on me---and it should be wasted on you too. If a reader asks nicely, you should be able to send him/her/it instructions for replicating your procedure from raw data to little stars after the t-statistic in your paper, and that means eschewing the clicky buttons for a written script.
|
06 October 05. Why macroeconomics sucks
Or more specifically, why time series analyses are not to be trusted.I've often mentioned here that time series analyses need to be eyed with great suspicion. Here, I give a detailed explanation why. How it's supposed to work:
The worst macro models and time series screw up every single step in the above chain---sometimes many times over. Write down a modelThere are more data points than any of us could possibly count. However, there are only so many causal stories that we humans have been able to write down. Given a time series to explain, like GDP, and a hundred completely random variables, like sales of Barbies, beats-per-minute of Billboard's #1 song, Pantone numbers for the colors on Ikea's catalog, you are guaranteed with near-certainty that at least one of those variables has a strongly significant correlation with the variable to be explained. Here, apophenia kicks in, and after you've seen that BPM and GDP are correlated, you'll have no problem inventing a model for it. In short, the model has to come first, and has to be a serious attempt at explaining the world. But you knew that. There's also the problem that it is very difficult to write down a model for which there isn't another model with the causation the other way `round. Timing won't necessarily save us: Christmas card sales cause Christmas, as the saying goes. But that's another deeper problem. No, really, write down a modelNow, one problem with real data is that everything moves at once. Thus, as soon as you say 'A causes B', somebody will brusquely interject that no, C causes B, and A is just a bystander in it all. Therefore, macroeconomic papers often control for 'the usual variables'.Reapplying the principle from the last section, don't write down a regression without a model to back it up—and having a model to back up half of it is as bad as having no model at all. You no doubt felt that the section about how having a model is important is self-evident, and most serious macro papers will start off with a good model. But the statistics with which the model is estimated will almost always include spare variables which aren't in the model. With micro papers, it's a mixed bag; with macro papers, this is the norm. Without a model to say anything about the extra variables, you've got a lot of leeway to screw around. If C, D, and E don't reshape B the way you want, try C squared, log of D, and D times E ('D interacted with E', as the lingo goes. Used in this manner, this term is meaningless. Does D times E appear in your model?). We throw out the parameters estimated for these extra control variables, and some take that to mean that we don't have to bother with a model for them. That alibi is the downfall of serious time series analysis, and covers a great deal of the empirical macro literature. using facts about the worldI used to think that stats is just an arbitrary list of customs that we made up so we have common means of arbitrating our disputes. But no, there are solid foundations to it.Flip a coin a hundred times, write down the number of heads. Repeat the hundred-flip procedure a few hundred times. You now have a list of numbers between zero and a hundred, and you can plot their frequency. Most of the numbers you wrote down will be near fifty, and things will taper off as you get to the ends—a bell curve. By which I do not just mean a curve which is fat in the middle and wide at the ends. I mean: p(x)= exp((x/50-1)^2)/sqrt(pi). This is the sort of precision we need to be able to say that one thing differs from another with 95.28% certainty. In fact, it's the sort of thing we need to say anything at all, because we only have a few coin flips, but we're trying to say something about what would happen with an indefinite number of future coin flips. Having a mathematical theorem about the probabilities in the limit means we don't have to take guesses. Why is there the square root of pi in the denominator? One way to explain it is to point out that it integrates to one using a trick in polar coordinates. We're harking back to the relationship between exponentiation and trig on the complex plane that you slept through in high school. For many other processes, there are other very specific things we can say about the resulting probability distribution. Statistics relies desperately on these few tricks we have at our disposal, generally known as the Central Limit Theorems (CLT). What do you do if the process you've modeled doesn't fit the assumptions of a CLT? Then you can't say how confident you are that the value for A that you drew differs from zero, because you don't know the distribution of A down to a square root of pi. If you draw another set of values for A, maybe it'll be distributed like the few that you've already drawn, but maybe it'll be totally different. For example, for many types of game, if you have two players repeating the game a thousand times, the distribution of actions that player one took will have nothing at all to do with the distribution of actions that player two took, because the distributions are not independent---what player two does is directly related to what player one does. Experimental game theorists know this, and set the unit of one observation to be a whole run of the game. If you want enough data, don't have a hundred people play the game together and call it a hundred data points, but run the entire experiment with all new people a hundred times. Back to time series. The model claims that the variable of interest is related to the various other things we wrote down, plus or minus an error term each period. As with the game playing above, the CLT will only apply when those error terms are independent and identically distributed. Independent: the error last period had nothing to do with the error this period; identically distributed: the distribution we're drawing the errors from doesn't change with time. For a time series, these assumptions are untenable. It is very difficult to invent a story for what those error terms mean that reasonably fits the independence and identical distribution assumption. Yes, I know you allowed for different variances via the var-covar matrix, but why am I supposed to believe you that the mean is constant or follows a linear trend that you can soak up with the coefficient on date? What that means is that we can't apply the central limit theorems unless we make hefty assumptions about the world outside the model: everything in the world can be reduced to one variable, whose mean is constant, or at best, whose mean moves at a constant, linear step every period. That error is probably the mean of a hundred variables, many of which are moving up with time and many of which are moving down with time---no CLT on earth is going to tell you anything about what that series of errors will look like (even though there are CLTs to say something about the mean of unordered and independent draws from a hundred outside variables). OK, one last attempt at explaining it: for any variable you gather over a hundred periods, I can find you a hundred other unrelated variables, somewhere, that sum up in some reasonable linear combination to exactly the data series you gathered. In many lab or micro settings, this isn't the case, but it holds for the vars of all macro time series studies. So if you include the hundred variables which replicate your pet variable, your pet variable will lose significance (especially since collinearity means your data matrix is now singular and (X'X)^-1 blows up). So you exclude the hundred variables—but now your error term is based on a process which replicates your variable, and IID goes up in flames again. Darned if ya do, darned if ya don't, and there's no mathematical formula to tell you how to select variables for inclusion or exclusion, so it's a crap shoot, which means the parameter estimates you get are a crap shoot. The sane thing to do would be to use the model, but the model doesn't mention but a few variables, and the other typical time-seriesesque variables just get included by fiat or custom. Sure, it happens often enough that the error term really is well behaved, and everything that isn't in the model has a neutral effect. It is often true here in the real world that A does cause B. But most statistical analyses in macroeconomic and time series studies do little or nothing to help ferret that out, because they need to make a dozen arbitrary outside-the-model assumptions for the statistical test of the model to work out. Without backing up the assumptions in mathematical results, they are as up for question as the model itself. Yeah, sometimes it's OK to fudge the assumptions (which are never 100% true to begin with), but it's nice to at least pretend to take them seriously. Policy implicationsBe Bayesian. When the paper says 'Therefore, the coefficient on A is significant with 99.99986% certainty', read that to mean 'there's one more piece of evidence that there might be something special about A.' With a hundred of `em, we can maybe start believing that A really is special. But any one time series, no matter the R^2 or the number of stars by the coefficients, can only provide a limited amount of evidence, because unless it is done right, it flaunts too many of the fundamental assumptions underlying statistics.[link][no comments]
|
11 April 06. Anti-intellectual
Pundit is a term from Hindi meaning “wise and learned man”, but it is usually used sarcastically in modern parlance. But, y'know, I don't feel so sarcastic about it. You can decide the “wise” part for yourself, but having spent a couple of years studying the narrow topic of subject matter expansion in patent law, I am confident describing myself as an authority. It's been months since I've heard a new argument on either side of the debate, and the new facts I'm learning are increasingly fine details. I don't feel any hubris when I say that nobody is going to blindside me on the tiny, narrow bit of subject that I have chosen for myself. And ya know, most of the arguments that I have presented in various media and to various bigwigs over the last few months are arguments that numerous non-experts have also made.
I often run into people who divide academic results into two categories:
(1) things anybody could have come up with after a bit of thought, and
(2) things that are too esoteric to be worth anything. Some exceptions
are made for chemists and engineers, whose work the commonsense folk have
some sense is esoteric but will somehow eventually lead to new toys or a
cure for something, but everybody else--the mathematicians who study
tensors in R14
OK, what are we to make of this? What message is being sent? Mashing
together the studies means that the findings do not add up to any real
image of the world, even if the page does categorize the findings for
some sense of flow. Readers can't drop these tidbits into cocktail party
conversation, because they only have one piece of information and so
aren't armed for even the simplest follow-up. Interested readers can't
learn more, because there are no citations. More importantly, there is
no context: we are not given the reason for studying guppy reproductive
systems, so we don't know why a scientist would care to do such a thing.
Being the back page, we know that it's supposed to be humorous,
and with everything taken out of context, it can be, the way that so
many
statements out of context or in a different
context
are funny. But there's
also the sense of laughing at the scientists. The subject of every
sentence (but the passive-voice ones) is a researcher or a study or a
survey. If the editors just wanted to list facts, they'd say “Americans
are becoming less repulsed...” but instead they waste ink pointing
out that “A study found that Americans are becoming less repulsed...”.
If there were an American Association Against Science, they would
probably reprint the Findings page verbatim. The AAAS would ask, in
big red letters, "Why are we spending money on this?" and the answer to
why would not be anywhere to be found.
But you know that I spend all day studying obscure features of
people's behavior and reading math books, so it's no surprise that I'm
anti-anti-intellectual. It's no secret that if I had an anti-intellectual
in the room here, I'd tell him or her (reading from Harper's again) “New
data suggested that Uranus is more chaotic than was previously
thought.”
[See, statements in a different context are downright hilarious!]
But it goes further than my kind of academic. The anti-intellectual
sentiment--the insistence that it's either common sense or it's
not worth the trouble--is a belief that there is no such thing as
an expert. It is the myopic belief that if I don't know it, then there's
nothing to know. As such, the anti-intellectual sentiment is often aimed
at targets well far afield from intellectuals.
At the Baltimore Museum of Art, the
same establishment that houses Picasso's
Mother
and Child, are such aggressively simple works of art as two silkscreen
reprints of the Last Supper, and a curtain of blue and silver beads. Some
readers will recognize the first as a work by Andy Warhol, and thus know
the context: Mr. Warhol felt that the repetition and mutation of familiar
images created new perspectives. For the second, as for a great deal of
art that was clearly easy to execute, we don't know the context at
all.1 But even though we
don't know it, there is a context. The guy went to art school, has had
a few focal ideas that drove all his work, and has done years of pieces
that led to this simple bead curtain.
So what is an expert to do? One approach is to always stick to things
that are obscure and look hard. Make sure that every study, every work
of art, every essay says fuc* you, I'm an expert
and you can't do what I do. But we value people who make it
look effortless, whether they're figure skating, producing a painting,
or running regressions. We always value simplicity, so if all it takes
to get across the message is a curtain of beads, then why overcomplicate
things to remind the viewer that it took years of work to get there? Some
of the best guitarists out there never really ventured past four
chords, while the guys who can play intricate solos are often dubbed
wankers.
I'm glad I wrote my PhD thesis, and more generally love the idea of a
thesis in general, including for high school seniors, BAs, or anywhere in
between. A good thesis means that the author has become an expert in some
tiny, irrelevant little corner of the world. Research ability by itself
is valuable, and it's good practice for when the student needs to be an
authority in something of more practical value, but it also gives the
student an idea of what the other experts of the world have gone through
to get to their simple ends. Remember that part in Zoo Story
where the guy says that “sometimes it's necessary to go a long distance out of the
way in order to come back a short distance correctly”?
A student who has gone a long way in becoming an expert, and
must then reduce that to the sort of ten second summaries that we all
give to friends and family, will have a better understanding of the long
distance that other experts have gone before they could string together
simple words or beads or chords.
|
14 May 06. The Web as human network
I'd like to discuss the question of how technology has changed personal relations. That'll come next time. For now, let's look at a specific, vaguely related question:does the link structure of the Net mirror the link structure of human networks? Back when Alta Vista was the highest view in Internet search, a few IBM and Alta Vista researchers did a rather detailed study of the Web's structure (1). They, as with many others, found that the distribution of links on the Net looked a lot like the distribution of human links. There is a power law distribution where there are a few sites that are linked endlessly, and a long tail of sites that only have a few links.
To give an example of a power law, here is a graph based on data from junior high classes. The most popular student is on the X-axis at the far left (at X=0), and was nominated as a best friend by a mean of 9.75 other students (over 88 classrooms in the sample). Over on the other end of the X axis, the 25th through 35th ranked student in the classroom was nominated as a best friend by a mean of less than one other student. So you've got a few very well-connected students and a lot of students who have no connections at all. We see this pattern in social networks of all scales, and among Web pages. The nomination count graph is typically a little more curvy than this one, with even more of a steep slope down from the most popular members of the group and a longer tail at the other end. It sounds like the WWW as interpersonal network metaphor is working OK, but two caveats: first, there is much debate as to whether the best fit for the link distribution of the Web is a Negative Exponential, a Gamma, a Zipf, or a variety of other distributions that all look identical to a non-expert. Unless you hope to study this stuff seriously, you don't have to care about this caveat and can just call it a power law. The best fit to the student data is a Gamma distribution, by the way. Second, human networks are pretty symmetric, in that there are few face-to-face contacts where one party is ignorant of the other. This is true of celebrities, whom we know but don't know us, but we can throw those out and have a reasonably symmetric set of acquaintance links. The popular kids may not want to hang out with the unpopular ones, but they know them nonetheless. But with Web pages, it happens all the time that a page makes no indication of what other pages are linking to it.
Broder et al found that this asymmetry occurs on a grand scale. They divide the Web into a giant Strongly Connected Component (SCC) comprising about a quarter of the Web; these are sites that interlink with each other. Then there's a quarter that only links in to the SCC but does not receive links. That would be blogs from losers like me. Then there's a quarter that is linked from the SCC but does not link to anything in particular, comprising corporate sites that just go in internal circles and things like online books and manual pages that are informative but not filled with links. The final quarter, they called <span class="airq">tendrils</span>, indicating a trail of limited links that doesn't readily fall into the first three categories. Thus, because a web page is not a person, the symmetry of human networks does not map to web links. Another important distinction is that the whole small world game, where we try to find a chain of people from a guy in Katmandu to a guy in Omaha, does not work for the Net, because if you start on the right side of the bowtie, you can not get to the left side. For humans, you can almost certainly find a chain, and it'll be well under ten people in almost all cases; for the Net, you only have about a 25chance of being able to form a chain from any randomly selected site to any other randomly selected site. E.g., try getting from This haphazard site in Canada to this site here (hint: you can't). When you can form a chain, say from the in-feeding region to the SCC region, then it can still be hundreds of nodes long if one element is well-buried in a subculture. Now, with human networks, we can distinguish between acquaintance, which is almost by definition symmetric, and friends, which is depressingly unidirectional, typically from low-status to high-status. I don't believe this metaphor is particularly well-studied, but it doesn't work very well. The net receivers of links for the Net are not high-status pages, but pages that just provide information (corporate, technical, whatever). But getting back to the part of the metaphor that does work, there are two characteristics to both networks. First, there's a cost to linking both socially and online, because you need to find the subject of your interest and know them. Second, there is a cost to searching for new links. An immediate corollary to expensive search is a principle that the rich get richer: the easiest way to find new links for your own personal address book is to ask others for their contacts, so well-linked people/sites are more likely to get more links. More on this next time.
(1) @articlebroder:net, title = "Graph Structure in the
Web",
[link][a comment]
|
26 May 06. Invariants
This is about two technological revolutions that didn't happen, and aren't going to happen any time soon. To some extent, this is also about a recent revolution in economics, where the study of how people interact has shown that there ain't nearly as much variation as we'd thought before: what we thought was wide variety is actually just a combination of invariants. More generally, it's a result of computational progress that has allowed us to pay more attention to distributions that are not in the Gaussian family (binomial, Normal, t, F, chi-squared) like the exponential, poisson, Zipf, &c. The problem is that we humans have limits, and they have not in any way changed thanks to technology. The key limits are time and memory. Who here bought R.E.M.'s Out of Time on vinyl or cassette? The first result of these limits is the size of our comprehensible network. That is, how many people do I know well enough that I could hold a friendly conversation with them? We can connect faster via cellular telephones, email, ntalk, or whatever point-and-talk technology has emerged since I wrote this, and so the time spent connecting is shorter, and we can cheaply connect to more distant people. But once the connection is made, we still have to resort to just talking or writing as before. This takes time, and the new toys don't speed this up at all. Sure, you've got Friendster (or whatever the cool kids are using these days) allowing you to browse through photos of your pals, but back in the day, you had a paper address book, with scraps of everything hanging out of it, that let you do the same thing. de Sola Pool and Kochen [C] made various attempts at estimating the number of acquaintances that a person has, and found that folks generally have about 1,500 immediate acquaintances whom they will see over the next two months once or twice and say hi to, and then about 4,500 less direct acquaintances, like the people from college whom they'll only see every few years. Perhaps our online networks have sort of blurred the lines on the close-by acquaintances and distant acquaintances, but how many hundreds of your high school pals have emailed you lately? But that's all scale: what about structure? Are our social hierarchies flatter and more egalitarian now that we've got the Net? Again, no. We still see the same sort of pattern we saw in last episode: a few people who are very well connected and a lot of people who are minimally connected. The debate (about which I am no authority) is whether this is because some people have a higher capacity to maintain pals, due to more time dedicated to it or an innate name-and-face memory; or because of a rich-get-richer story that people find new pals via their old pals, so those who are well-networked will only wind up better-networked in the future. The true story is no doubt a bit of both. Costly maintenance of links and costly search for new links have not changed for us humans. Generally, if you've got both of those characteristics, you're going to have a network that looks like standard social networks, and if those limits are set by the human brain and our 24 hour day, then the scale of those networks is set.
ContentMoving on from social networks, the second limit is in what we can produce. If you spent every minute of the next year typing away at your keyboard, your computer's hard drive would barely notice it all. [1 word= about 6 bytes. Given 60 words per minute times 1440 minutes per day = 518,400 bytes/day; in a year that's 180MB.] For most of us, everything we ever wrote would easily fit onto a single CD. That is, the technology of text processing has blown past the human ability to produce text.For music and still pictures, we're in about the same place. The roadblock is not in storage and transmission, but in the process of finding artistic inspiration and the time and skill needed to execute it. Moving pictures are not far behind, and twenty years from now, downloading a movie won't take a moment's thought by anybody. Nobody will worry about the price of film stock, but the process of writing and producing a movie will still be a massive effort. On the consumption side, it still takes 70 minutes to listen to Beethoven's Ninth, though you no longer have to get up and flip the disc in the middle. It still takes 90 minutes to watch a ninety-minute movie. The articles that I have on my hard drive in the `read any day now' pile has certainly grown, but the `articles I've read' pile grows at the slow, steady pace it always has, and the `articles I remember reading' pile continues to wither. So scale is again set. As for structure, we find that there is again the same power-law type distribution in consumption. If we plot sales and Amazon sales rank on a log-log scale, we find that it's linear. In other words, the top ten best-selling books sell ten times as much as the bottom of the top 100, and those sell ten times as many as the bottom of the top 1,000, and so on down into the millions. [Below the top sellers, by the way, the ranking is basically the order of last sale, by the way.] That is, content is another power law, and that structure doesn't change with onlineness: before millions of blogs only read by three people, there were `zines only read by three people, and before that, letters. So the distribution of book popularity happens to match the distribution of people popularity, which is no surprise, because the same two problems--costly search and costly linking/consumption--are an issue in both cases.
Policy implicationsWe are all more-or-less as networked as we're going to be by maybe age sixteen [socially; sexual networks follow different patterns from social networks, and tend to take more of a rich-get-richer form.[S]]. When you meet somebody new, they're crowding out somebody else, as time spent cultivating your new pal is not time spent cultivating the old. The same works for entire networks: just as advertisers must compete for your few dollars, networks must compete for your limited networking resources. Similarly, having a wealth of new content available just means that we have a wealth of things that we'll never read because they're crowded out by the other things we're reading.I don't mean to say that the Web as a whole is a stagnant waste or that our information processing abilities are irrelevant. But with regards to certain basic human desires, we arrived about fifteen years ago when everybody got a PC, and everything since then has just been adding more features, giving you one more place where you can start a blog and one more list of contacts to keep synced.
[C] @articlepool:contacts,
[S] @articlelijeros:sex,
[link][no comments]
|
10 August 06. A time series analysis of Amazon sales rank
I have been very interested in the sales of Math You Can't Use: Patents, Copyright, and Software, a book with which I was heavily involved. (Amazon page) So naturally, I've been tracking the Amazon sales rank. At first, I did it the way everybody else does--refreshing the darn page every twenty minutes--but I have recently started doing it the civilized way--an automated script. Here is what I've learned about how Amazon does its rankings.
Background and conclusionFirst, to give you some intuition as to sales rank, here's a little table:
How much more detail can we get? The answer: none, really. You'll see below that over the course of a few days, the ranking of a typical book will go from 50,000 to 500,000, and a minute later it will be back at 50,000. Thus, the sort of things we usually do with a ranking, like compare two books, are unstable to the point of uselessness.
One thing you evidently can do with the ranking is determine whether a
book has sold a copy in the last hour or two. As you'll see below,
there's a simple formula that will work for most books: if (current
rank) >
You can see from the graph that the pattern is a sudden jump and then a
slow drift downward. The clearest explanation is that the sales rank is
basically a function of last sale. When a copy sells, the book jumps to
a high rank, and then gets knocked down one unit every time
any lower-ranked book sells.
There are lots of details that those of us not working at Amazon will
never quite catch. There are periods (sometimes mid-day) when the
rank drifts down more slowly than it should, then speeds up in its
descent. This implies to me some computational approximations that
eventually get corrected. You'll notice that some of the books below show a
small slope upward (a ten or twenty point rise in ranking) from time to
time. When this happens, lots of books do it at once, also indicating
some sort of correction whose purpose or method I don't have enough
information to divine. Epstein and Axtell's book rises appreciably when
it nears half a million. Filally, I don't have enough data to determine
whether the ranking distinguishes between sales of used and new copies;
I don't think it does.
Here is a haphazard sampling of other books. Again, these are
dynamically regenerated every three hours, so come back later for more
action-packed graphing. Some of these books bear something in common
with Math You Can't Use, and others were based on a trip to
the used book store I'd made the other day. Some have hardcover and
paperback editions, in which case I just graph the paperback.
Epstein and Axtell's Growing Artificial Societies
[Amazon p.]
Andy Rathbone, Tivo for Dummies. I have no idea
who would buy this, and yet it is the best nonfiction seller here. This
proves that I must never go into marketing.
[Amazon p.]
Dickens's Great Expectations, Penguin Classics ed.
[Amazon p.] Books in the top 10,000 or so are selling several copies a day, so the pattern looks different.
Madonna's Sex. [Amazon p.]
Somebody ran into the used bookstore asking for a copy, and ran out when the owner said he didn't have one. It's amusing that a book from 1992 could still instill such fervor in a person.
It sells new for $125, used around $85.
Ian McEwan's Atonement.
[Amazon p.]
I really thought I'd hate this book,
since it starts off as being about subtle errors in manners committed by
a gathering of relatives and friends at a British country manor, but it
turned out to be an interesting modern take on the genre.
Those of us interested in the sales rank of books outside Oprah's picks
would be better served if the system were less volatile. In technical
terms, if my guess that the score experiences exponential decay is
correct, then the ranking system would be more useful to those of us
watching the long tail if the decay factor were set to a smaller value.
The data looks to me like an exponential decay system,
where you have a current score St
To fit this, I
flipped and renormalized the rankings so that one was the highest
possible ranking, and zero corresponded to a ranking of 500,000. Then, I
set the following algorithm:
As you can imagine, I found those constants via minimizing the distance
between the estimate and the actual. The algorithm
is an exponential decay model with
λ = 0.96
The green line shows the exponential decay model fit to the
actual data. You can decide if this is a good fit or a lousy one.
You can also have a look at how the model fit to
Madonna's book.
Usage:
This is pretty rudimentary; in the spirit of open source, I'd be happy
to post your improvements.
on Thursday, August 10th, techne said
Very fun. If only I had a book to track.
on Thursday, August 10th, Andy said
You can also use this formula to judge how Amazon's business is doing overall. When the book drifts down at a faster rate, that means that more books underneath it are selling; when it drifts down more slowly, fewer books below it are being sold.
on Thursday, August 10th, Miss ALS of San Diego, of course said
You're a geek. on Sunday, August 20th, AC said
Heh. Neat.
on Sunday, March 25th, Mike said
I know why Tivo for Dummies sells -- I work at a call center for Directv, and get calls all the time like: "How do I record a show, How do I erase a show, how do I set up to record something regularly?". It really makes one sad for the future of society because more than half of theese people have VCRs and can use them well. Just remember... It may be obvious, it may be EXACTLY THE SAME as something they ALREADY USE, but it ISN'T what they already use, so obviously their pre-existing knowledge is worthless. This holds true for web sites and computer programs as well.
|
10 September 06. The statistics style report
It may sound like an oxymoron, but there is such a thing as fashionable statistical analysis. Where did this come from? How is it that our tests for Truth, upon which all of science relies, can vacillate from season to season like hemlines? Before answering that question, note that statistics as a whole is not arbitrary. The Central Limit Theorem is a mathematical theorem like any other, and if you believe the basic assumptions of mathematics, you have to believe the CLT. The CLT and developments therefrom were the basis of stats for a century or two there, from Gauss on up to the early 1900s when the whole system of distributions (Binomial, Bernoulli, Gaussian, t, chi-squared, Pareto) was pretty much tied up. Much of this, by the way, counts not as statistics but as probability. Next, there's the problem of using these objective truths to describing reality. That is, there's the problem of writing models. Models are a human invention to describe nature in a human-friendly manner, and so are at the mercy of human trends. Allow me to share with you my arbitrary, unsupported, citation-free personal observations.
Number crunchingThe first thread of trendiness is technology-driven. In every generation, there's a line you've got to draw and say `everything after this is computationally out of reach, so we're assuming it away', and the assume-it-away line drifts into the distance over time. Here's a little something from a 1939 stats textbook on fitting time trends:
To fit a trend by the freehand method draw a line through a graph of the data in such a way as to describe what appears to the eye to be the long period movement. ...The drawing of this line need not be strictly freehand but may be accomplished with the aid of transparent straight edge or a “French” curve. As you can imagine, this advice does not appear in more recent stats texts. In this respect, a stats text can actually become obsolete. However, true and honest approximations like this are relatively rare. Instead, more computing power allows new paradigms that were before just written off as impossible.
Computational ability has brought about two revolutions in statistics.
The first is the linear projection (aka, regression). Running a
regression requires inverting a matrix, with dimension equal to the
number of variables in the regression. A two-by-two matrix is easy to
invert (ad - bc
So revolution number one, when computers first came out, was a shift
from simple correlations and analysis of variance and covariance to
linear regression. This was the dominant paradigm from when computers
became common until a few years ago.
The second revolution was when computing power became adequate to do
searches for optima. Say that you have a simple function to take in
inputs and produce an output therefrom. Given your budget for inputs,
what mix of inputs maximizes the output? If you have the function in a form that
you can solve algebraically, then it's easy, but let us say that it is
somehow too complex to solve via Lagrange multipliers or what-have-you,
and you need to search for the optimal mix.
You've just
walked in on one of the great unsolved problems of modern computing. All
your computer can do is sample values from the function--if I try these
inputs, then I'll get this output--and if it takes a long time to
evaluate one of these samples, then the computer will want to use as few
samples as possible. So what is the method of sampling that will
find the optimum in as few samples as possible? There are many methods
to choose from, and selecting the best depends on enough factors that we call it
an art more than a science.
In the statistical context, the paradigm is to look at the
set of input parameters that will maximize the likelihood of the
observed outcome. To do this, you need to check the likelihood of every
observation, given your chosen parameters. For a linear regression, the
dimension of your task was equal to the number of regression parameters,
maybe five or ten; for a maximum likelihood calculation, the dimension
is related to the number of data points, maybe a thousand or a million.
Executive summary: the problem of searching for a likelihood function's
optimum is significantly more computationally intensive than running a
linear regression.
So it is no surprise that in the last twenty years, we've seen
the emergence of statistical models built on the process of finding an
optimum for some complex function. Most of the stuff below is a variant
on the search-the-space method. But why is the most likely parameter
favored over all others? There's the Cramer-Rao Lower Bound
and the Neyman-Pearson Lemma, but in the end it's just arbitrary. Gauss
had no theorems that this framework gives superior models relative to
linear projection, but it does make better use of computing technology.
Statistical modeling sees the same cycles, and the fluctuation here is
between the parsimony of having models that have few moving parts and
the descriptiveness of models that throw in parameters describing the
kitchen sink. In the past, parsimony won out on statistical models
because we had the technological constraint.
If you pick up a stats textbook from the 1950s, you'll see a huge number
of methods for dissecting covariance. The modern textbook will have a
few pages describing a Standard ANOVA (analysis of variance) Table, as
if there's only one. This is a full cycle from
simplicity to complexity and back again. Everybody was just too
overwhelmed by all those methods, and lost interest in them when linear
regression became cheap.
Along the linear projection thread, there's a new method introduced
every year to handle another variant of the standard model. E.g., last
season, all the cool kids were using the Arellano-Bond method on their
time series so they could assume away endogeneity problems. The list of
variants and tricks has filled many volumes. If somebody used every
applicable trick on a data set, the final work would be supremely
accurate--and a terrible model. The list of tricks balloons, while
the list of tricks used remains small or constant. Maximum likelihood
tricks are still legion, but I expect that the working list will soon find
itself pared down to a small set as optimum finding becomes standardized.
In the search-for-optima world, the latest trend has been in
`non-parametric' models. First, there has never been a term that deserved
air-quotes more than this. A `non-parametric' model searches for a
probability density that describes a data set. The set of densities
is of infinite dimension. If all you've got a hundred data points, you
ain't gonna find a unique element of
ℜ∞
But `non-parametric' models allow you to have an arbitrary number of
parameters. Your best fit to a 100-point data set is a sum of 100 Normal
distributions. If you fit 100 points with 100 parameters,
everybody would laugh at you, but it's possible. In that respect, the
`non-parametric' setup falls on the descriptive end of the
descriptive-parsimonious end of the scale. In my opinion.
I don't want to sound mean about `non-parametric' methods, by the
way. It's entirely valid to want to closely fit data, and I have used
the method myself. But I really think the name is false advertising.
How about distribution-fitting methods or optimal distribution
estimation?
Bayesian methods are increasingly cool. There are the computational
problems, that if you want to assume something more interesting than
Normal priors and likelihoods, then you need a computer. Those have been
surmounted, leaving us with the philosophy issues. In the context here,
those boil down to parsimony. Your posterior distribution may be even
weirder than a multi-humped sum of Normals, and the only way to describe
it may just be to draw the darn graph. Thus, Bayesian methods are also
a shift to the description-over-parsimony side.
Method of Moments estimators have also been hip lately. I frankly don't
know where that's going, because I don't know them very well.
Also, this
guy
really wants multilevel
modeling to be the Next Big Thing in the linear model world, and makes
a decent argument for that.
You can see that the increasing computational ability invites shifting
away from parsimony. Since PCs really hit the world of day-to-day stats
recently, we're in the midst of a swing toward description.
We can expect an eventual downtick toward simpler models, which will
be helped by the people who write stats packages--as opposed to the
researchers who caused the drift toward complexity--because they write
simple routines that implement these methods in the simplest way possible.
So is your stats textbook obsolete? It's probably less obsolete than
people will make it out to be. The basics of probability have not moved
since the Central Limit Theorems were solidified. In the end, once
you've picked your paradigm, there aren't really many methods out there
for truly and honestly cutting corners; most novelties are just about
doing detailed work regarding a certain type of data or set of assumptions.
Further, those linear projection methods or correlation tables work
pretty well for a lot of purposes.
But the fashionable models that are getting buzz shift every
year, and last year's model is often considered to be naïve or too
parsimonious or too cluttered or otherwise an indication that the author
is not down with the cool kids--and this can affect peer review
outcomes. A textbook that focuses on the sort of details that were
pressing ten years ago, instead of just summarizing them in a few pages,
will have to pass up on the detailed tricks the cool kids are coming up
with this season--which will in turn affect peer reviews for papers
writen based on the textbook's advice. All this is entirely frustrating,
because we like to think that our science is searching for some sort of
true reflection of constant reality, yet the methods that are acceptable
for seeking out constant reality depend upon a bit more on human whim
than I'd really like.
on Monday, September 11th, Andy said
Interesting idea that methods as well as theories can go through paradigm shifts. But how do you know that this doesn't really represent progress? Regressions are more powerful than (i.e. a superset of) AN(C)OVA, and we ain't going back to those old days. So there is a competitive process of creative destruction, yadda yadda, until the best stats win. For example, take Huber-White robust standard erros. Or Dickens/Moulton style clustered standard errors. Nowadays people just use them without making a big deal about it, because they work. Maybe A-B will be in that same category ten years from now. Really, one problem I always have is figuring out how much the reader knows already and how much I should spell out.
on Monday, September 11th, Miss ALS of San Diego said
I think the truly interesting thing about shifts in methods is that until they are considered 'street wear' and not just haute coutour, those using the latest thing have to examine (and, gasp, explain) their assumptions. People forget all about the requirements for OLS all_the_frickin_time...but people see OLS and they know how to interpret the results, so they don't bother figuring out whether OLS is appropriate. if you're using a Bayesian technique (which is also horribly named, i.m.o), you're got to convince people that your priors are reasonable, you've got to have a deeper understanding of the method because uber-human friendly programs like stata won't just chug it out for you.
|
26 October 06. Is Ruby halal?
The starting point here is last episode's essay on programming languages, and this here is basically an explanation and generalization of why I wrote it. For those who didn't read it (and I don't blame ya), here's a summary in the form a description of my ideal girlfriend: she should be an Asian Jewess, around 172-174cm tall, gothy, sporty, significantly smarter than me, significantly cuter than me, significantly better socialized than me, willing to hang out with me, very well organized but endlessly spontaneous, enjoys walks along the beach, does intellectually challenging work that involves being outdoors, and plays guitar in a rock band. Yeah. So: too bad half of those things contradict the other half, eh. The first key difference between the problem of picking a programming language and the problem of picking a significant other is that the programming language doesn't have to like you back. The liking-you-back issue creates many volumes' worth of interesting stories, all of which I will ignore here, in favor of the the other key difference: unlike many girl/boyfriends, programs are often shared among friends and coworkers, meaning that there are externalities in my arbitrary, personal-preference choice. Personal preference plus externalities is the perfect recipe for never-ending, repetitive debate.
Debating the undebatableUnder Jewish law, one must never say the Name of God. In fact, there is none--it's sort of a mythical incantation, used to breathe life into Golems and otherwise tell monotheistic fairy tales. Under Islamic law, one must speak the Name of God when slaughtering an animal for the animal's flesh to be halal. My reading here is that there is therefore no way for meat to be both halal and kosher.And let's note, by the way, that kosher and halal laws are not cast as rules about keeping clean for the sake of disease prevention. They're ethical laws, meaning that, like personal preference, they can't really be debated. It's not like somebody will finally find the correct answer and write it down for everybody to see. We can't even agree to basic axioms like `you should be nice to people' or `don't be wasteful'. Do ethical laws induce externality problems? From the looks of it, yes they do, because so many people spend so much time trying to get other people to conform to their personal ethics. Ethics are an extreme form of that other personal preference, æsthetics, and seeing somebody commit what you consider to be an unethical act is often on par with watching somebody wearing a floppy brown sweater with spandex safety orange tights. Fortunately, almost everybody understands that there is no point going up to Mr. Brown-and-orange and telling him he needs to change, because we all know exactly how the conversation will go: some variant of `I have my own personal preferences' or `who are you to impose your arbitrary choices upon me'. That is, it would be a boring argument, because there is fundamentally no right answer. When does human life begin? I have no idea, and anybody who says otherwise is guilty of hubris. Gee, that was a fun debate, wasn't it. And the problem with that non-debate, as with this essay, is that it has no emotionally satisfying conclusion. The natural form of a debate is for one side to present its best arguments, the other side to present its own, and then both sides go home and think about it. But the form of debate that is emotionally satisfying has a resounding conclusion, where one side tearfully confesses to the other, `OK, I was wrong!' But with arguments of ethics or personal preference, this sort of resolution happens about once every never. But there's a simple way to fix this problem: invent statistics. After all, not all debates are mere issues of personal preference. A question like `will building this road or starting this war improve the economy' has a definite answer, though we're typically not smart enough to know it. There is valid grounds for debate there. But for ethics and personal preference issues, we can still make it look like there are valid grounds for debate. Find out whether abortions decrease crime The paper that claims this, by Steven “Freakonomics” Leavitt and another not-famous economist, has been shown to be based on erroneous calculations. PDF, find out whether people commit more errors when commas are used as separators or terminators, run benchmarks, accuse the author of the file system you don't like of being a murderer. With enough haphazard facts, any debate about pure personal preference regarding simple trade-offs can be extended to years of tedium. This turns debates that should be of the natural form (both sides state opinions, then go home) into the resounding form of debate, where both sides attempt to get the other side to tearfully confess the errors of its ways. But the sheen of facts doesn't change the fundamental nature of debates over ethics or personal preference, and because these are debates where nobody is actually wrong, nobody will ever be convinced to bring about an emotionally satisfying conclusion. We instead simply have a new variant on the recipe for tedious, never-ending debate.
Relevant previous entries:
[link][2 comments]
|