Asst
Patterns in static

Some fluff, some info





navigational aids:
 





topics covered:

28 August 04. I (heart) factor analysis!

OK, I've made the comment often enough that it's not fun anymore, but stats textbooks often blur the distinction between descriptive procedures and hypothesis testing procedures, and this creates endless woes in the world---notably, researchers are often taught to do exploratory stuff on data and then test the hypotheses that they'd just cooked up using the same data, which adds to human knowledge in some ways, but the hypothesis test is completely and totally invalid, with 100% certainty. Won't harp now; have blogged about this already.

There's something of a focus on hypothesis testing with most of this stuff since, again, it's the hypothesis test that gets you published. So this little column is to remind the reader of one of those things that gets left by the wayside: factor analysis (aka singular value decomposition or principal component analysis). It's more advanced than the basic descriptive stats (mean, variance), but not a hypothesis test, so in my own limited experience it seems to get left by the wayside.

Before I lose every last one of you, here are two super-hip examples of why factor analysis is fun. The first is Poole & Rosenthal's analysis of the U.S. Congress. They found that about 90% of the variance in voting patterns can be described in two dimensions, so that can be made into a graph; string them together, and you have a movie of the U.S.'s political history.

Second example: Eharmony.com has a system to match you with someone you're most compatible with, and how do they do it? Factor analysis, of course. They give potential love birds a survey of a few hundred questions (oh, the tedium!) and then map the test-taker in a few artificial dimensions. They find people who are close in the artificial space, and have them go out on dates. Romance blossoms.

[It would be rude of me to encourage the reader to use factor analysis without mentioning that Eharmony has patented their matchmaking algorithm. Therefore, when you do your factor analysis, you must be careful that you don't assign variable names that sound too much like `sexual passion', `spirituality', or 27 other such terms---if you do, then you are in violation of the patent and could be sued for damages. As of this writing, most other names one could assign to the dimensions resulting from a factor analysis are still in the public domain, and can be used without a licensing fee.]

Oh, and my other favorite is a paper by Moore et al (citation available on request) about color perception: for sighted people, factor analysis neatly puts the perception of words such as `red', `blue', and `yellow' in two dimensions, in a circle---a color wheel. Factor analysis of responses to the same words by the blind fall on one dimension, basically ranging from bright to dark. And thus, factor analysis shows us what the blind see.

There are two ways you could go about it, I guess: the first is to say, `I have no idea how the data was generated, but darn it, I want a picture', in which case a two-dimensional factor analysis fits the bill wonderfully, and once you get the picture, you just might learn something (even if it can't be formally tested). The other is to say `I really think there are some latent variables driving the variables I've observed', and then factor analysis may again save the day, by showing you the best linear combination of existing variables to suggest what those latent variables may be. Both of these sorts of behavior are exactly what statistics is really about, and think they're a great thing to try on any given data set.

This all comes up because I have some data on how junior high kids smoke. I have zero information about the kids, not even their gender, since the guy who collected the data wants to play with it more and get some more publications out before he puts it out for others to use. So I'm not assuming that there are important latent variables underlying what data we have---I know that there are important missing variables underlying the data we have. The factor analysis did a very neat job of pulling out what I reasonably believe are the most important underlying characteristics, thus saving the day. [The hypothesis tests, by the way, are about ten steps further down the line in the project, so no, no worries about confounding the two. Won't bore you with the details.]

So don't forget to use factor analysis on your favorite data set, and maybe write a paper or two about the results, since good results from a good factor analysis really can teach the world stuff. Also, next time you're asked to review a paper, please remember that good descriptive stats can stand by themselves without being bolstered by a table of statistically sound but procedurally bogus linear regressions. After all, the true level of explanatory power and persuasiveness of a paper just isn't measured by confidence intervals.

[link][a comment]

on Wednesday, September 1st, April Petross said

hey Ben,
got your voice mail. Yes, I've been keeping up with your blog as well. I have been thinking about you a lot. apologize for not calling. Work is really stressful and I honestly don't know how late you are staying up these days. When is a good time to call? What is your primary email these days? Write and let me know. We (speaking for Jennifer here too) miss you.

xo.

Comment!
Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:
Name:
E-Mail:
Homepage:
 
 


12 February 05. Apophenia

Just a little note that I've finally put up my library of stats functions for public consumption and modification.


Before you all lose interest entirely, the name means a tendency to see patterns in static, which is the fundamental human tendency which statistics aims to combat. That is, the intent of good statistics, as far as I'm concerned, is not to uncover facts that we as humans couldn't work out ourselves, but to invalidate all of those claims that we come up with all day long but which are just an overactive imagination at work. Using statistics to uncover patterns we hadn't seen before is called `data mining', which used to be a dirty word, used to accuse those whose papers you didn't like, but management types have grown fond of it and now hire people to do it. I was once invited to a data mining conference.

But back to me. You may recall that a few months ago, before the software patent book, I'd written half a book about doing statistics in C. It received resistance because (1) I'm not a great writer and (2) it rejected the Universal Truth that all statistics must be done using a statistics package. One publisher, I recall, was rather explicit about (2).

But I still work in C, and I still do statistics, and there are other people out there just like me, I just know it. My desires are supremely simple: I just want a good, reliable toolbox. I'm not a visual person, and rarely do exploratory stuff in the way of just meandering through a data set. Usually, I have a specific question and want a nice, precise answer. That is, I want to apply a specific tool to the data.

The cute user interface of the typical stats package is also wasted on me---and it should be wasted on you too. If a reader asks nicely, you should be able to send him/her/it instructions for replicating your procedure from raw data to little stars after the t-statistic in your paper, and that means eschewing the clicky buttons for a written script.

From there, it's just a question of the language one wants to write the script in, and ya know, C is the language for me. Have already blathered about this, but here's the executive summary: I'm f.ing tired of learning new languages. I was feeling as though every time I start a new project, my former favorite language couldn't handle it. `Oh, you have limited dependent variables? Then use Limdep. Maximizing subject to constraints? That's what GAMS is for. You wanna do lots of matrix operations? Then switch back to Matlab. Except we don't have a license for it here.'

The library approach means never having to switch languages again: you'll need to find some C functions for the new trick you're trying to pull, but the ugly syntax is exactly the same, and the environment in which you program doesn't change, and if you had some cool functions in the last program that you want to reuse, you can call them directly. All that language-specific knowledge about what's easy or hard, and where you need to be careful to not mis-state things only builds from project to project. You just have to learn a sufficiently versatile language that can handle anything that may come forth, like C.

This project's contribution: a library of statistical functions at the same level as the stats packages. That is, a function which does OLS, a function which does factor analysis, et cetera. The lower-level stuff, about shifting matrices about and drawing from Gaussian distributions and querying the data, is handled by other libraries, so we don't have to worry about that.

So, dear reader, next time you find that you need to do a new statistical analysis, and your current language du jour doesn't work, give C a try. Maybe the library already has the function you need, and you're done. Maybe it doesn't, in which case you can exert the effort you would have taken to learn a new language and write the necessary function, but then you can contribute it for future use by the rest of us.
[link][no comments]

Comment!
Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:
Name:
E-Mail:
Homepage:
 
 


06 October 05. Why macroeconomics sucks

Or more specifically, why time series analyses are not to be trusted.


I've often mentioned here that time series analyses need to be eyed with great suspicion. Here, I give a detailed explanation why.

How it's supposed to work:

  • Write down a model of how variables influence each other.

  • Then, gather data to test the model, using a limited number of facts we know about the world (primarily central limit theorems).

  • To test, estimate parameters in the model; the facts about the real world give us some idea of the likelihood that the parameter we care about is not zero.


The worst macro models and time series screw up every single step in the above chain---sometimes many times over.

Write down a model

There are more data points than any of us could possibly count. However, there are only so many causal stories that we humans have been able to write down. Given a time series to explain, like GDP, and a hundred completely random variables, like sales of Barbies, beats-per-minute of Billboard's #1 song, Pantone numbers for the colors on Ikea's catalog, you are guaranteed with near-certainty that at least one of those variables has a strongly significant correlation with the variable to be explained. Here, apophenia kicks in, and after you've seen that BPM and GDP are correlated, you'll have no problem inventing a model for it.

In short, the model has to come first, and has to be a serious attempt at explaining the world. But you knew that.

There's also the problem that it is very difficult to write down a model for which there isn't another model with the causation the other way `round. Timing won't necessarily save us: Christmas card sales cause Christmas, as the saying goes. But that's another deeper problem.


No, really, write down a model
Now, one problem with real data is that everything moves at once. Thus, as soon as you say 'A causes B', somebody will brusquely interject that no, C causes B, and A is just a bystander in it all. Therefore, macroeconomic papers often control for 'the usual variables'.

Reapplying the principle from the last section, don't write down a regression without a model to back it up—and having a model to back up half of it is as bad as having no model at all.

You no doubt felt that the section about how having a model is important is self-evident, and most serious macro papers will start off with a good model. But the statistics with which the model is estimated will almost always include spare variables which aren't in the model. With micro papers, it's a mixed bag; with macro papers, this is the norm.

Without a model to say anything about the extra variables, you've got a lot of leeway to screw around. If C, D, and E don't reshape B the way you want, try C squared, log of D, and D times E ('D interacted with E', as the lingo goes. Used in this manner, this term is meaningless. Does D times E appear in your model?). We throw out the parameters estimated for these extra control variables, and some take that to mean that we don't have to bother with a model for them. That alibi is the downfall of serious time series analysis, and covers a great deal of the empirical macro literature.

using facts about the world
I used to think that stats is just an arbitrary list of customs that we made up so we have common means of arbitrating our disputes. But no, there are solid foundations to it.

Flip a coin a hundred times, write down the number of heads. Repeat the hundred-flip procedure a few hundred times. You now have a list of numbers between zero and a hundred, and you can plot their frequency. Most of the numbers you wrote down will be near fifty, and things will taper off as you get to the ends—a bell curve.

By which I do not just mean a curve which is fat in the middle and wide at the ends. I mean: p(x)= exp((x/50-1)^2)/sqrt(pi). This is the sort of precision we need to be able to say that one thing differs from another with 95.28% certainty. In fact, it's the sort of thing we need to say anything at all, because we only have a few coin flips, but we're trying to say something about what would happen with an indefinite number of future coin flips. Having a mathematical theorem about the probabilities in the limit means we don't have to take guesses.
Why is there the square root of pi in the denominator? One way to explain it is to point out that it integrates to one using a trick in polar coordinates. We're harking back to the relationship between exponentiation and trig on the complex plane that you slept through in high school.


For many other processes, there are other very specific things we can say about the resulting probability distribution. Statistics relies desperately on these few tricks we have at our disposal, generally known as the Central Limit Theorems (CLT). What do you do if the process you've modeled doesn't fit the assumptions of a CLT? Then you can't say how confident you are that the value for A that you drew differs from zero, because you don't know the distribution of A down to a square root of pi. If you draw another set of values for A, maybe it'll be distributed like the few that you've already drawn, but maybe it'll be totally different.

For example, for many types of game, if you have two players repeating the game a thousand times, the distribution of actions that player one took will have nothing at all to do with the distribution of actions that player two took, because the distributions are not independent---what player two does is directly related to what player one does. Experimental game theorists know this, and set the unit of one observation to be a whole run of the game. If you want enough data, don't have a hundred people play the game together and call it a hundred data points, but run the entire experiment with all new people a hundred times.

Back to time series. The model claims that the variable of interest is related to the various other things we wrote down, plus or minus an error term each period. As with the game playing above, the CLT will only apply when those error terms are independent and identically distributed. Independent: the error last period had nothing to do with the error this period; identically distributed: the distribution we're drawing the errors from doesn't change with time.

For a time series, these assumptions are untenable. It is very difficult to invent a story for what those error terms mean that reasonably fits the independence and identical distribution assumption.
Yes, I know you allowed for different variances via the var-covar matrix, but why am I supposed to believe you that the mean is constant or follows a linear trend that you can soak up with the coefficient on date?


What that means is that we can't apply the central limit theorems unless we make hefty assumptions about the world outside the model: everything in the world can be reduced to one variable, whose mean is constant, or at best, whose mean moves at a constant, linear step every period. That error is probably the mean of a hundred variables, many of which are moving up with time and many of which are moving down with time---no CLT on earth is going to tell you anything about what that series of errors will look like (even though there are CLTs to say something about the mean of unordered and independent draws from a hundred outside variables).

OK, one last attempt at explaining it: for any variable you gather over a hundred periods, I can find you a hundred other unrelated variables, somewhere, that sum up in some reasonable linear combination to exactly the data series you gathered. In many lab or micro settings, this isn't the case, but it holds for the vars of all macro time series studies. So if you include the hundred variables which replicate your pet variable, your pet variable will lose significance (especially since collinearity means your data matrix is now singular and (X'X)^-1 blows up). So you exclude the hundred variables—but now your error term is based on a process which replicates your variable, and IID goes up in flames again. Darned if ya do, darned if ya don't, and there's no mathematical formula to tell you how to select variables for inclusion or exclusion, so it's a crap shoot, which means the parameter estimates you get are a crap shoot. The sane thing to do would be to use the model, but the model doesn't mention but a few variables, and the other typical time-seriesesque variables just get included by fiat or custom.


Sure, it happens often enough that the error term really is well behaved, and everything that isn't in the model has a neutral effect. It is often true here in the real world that A does cause B. But most statistical analyses in macroeconomic and time series studies do little or nothing to help ferret that out, because they need to make a dozen arbitrary outside-the-model assumptions for the statistical test of the model to work out. Without backing up the assumptions in mathematical results, they are as up for question as the model itself. Yeah, sometimes it's OK to fudge the assumptions (which are never 100% true to begin with), but it's nice to at least pretend to take them seriously.

Policy implications
Be Bayesian. When the paper says 'Therefore, the coefficient on A is significant with 99.99986% certainty', read that to mean 'there's one more piece of evidence that there might be something special about A.' With a hundred of `em, we can maybe start believing that A really is special. But any one time series, no matter the R^2 or the number of stars by the coefficients, can only provide a limited amount of evidence, because unless it is done right, it flaunts too many of the fundamental assumptions underlying statistics.
[link][no comments]

Comment!
Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:
Name:
E-Mail:
Homepage:
 
 


11 April 06. Anti-intellectual

[PDF version]

Pundit is a term from Hindi meaning “wise and learned man”, but it is usually used sarcastically in modern parlance. But, y'know, I don't feel so sarcastic about it. You can decide the “wise” part for yourself, but having spent a couple of years studying the narrow topic of subject matter expansion in patent law, I am confident describing myself as an authority. It's been months since I've heard a new argument on either side of the debate, and the new facts I'm learning are increasingly fine details. I don't feel any hubris when I say that nobody is going to blindside me on the tiny, narrow bit of subject that I have chosen for myself.

And ya know, most of the arguments that I have presented in various media and to various bigwigs over the last few months are arguments that numerous non-experts have also made.

I often run into people who divide academic results into two categories: (1) things anybody could have come up with after a bit of thought, and (2) things that are too esoteric to be worth anything. Some exceptions are made for chemists and engineers, whose work the commonsense folk have some sense is esoteric but will somehow eventually lead to new toys or a cure for something, but everybody else--the mathematicians who study tensors in R14 , the biologists who study odd tropical flora, and most importantly, the anthropologists and sociologists and economists who study people, whom we all study every day--are wasting their time and our money.

Findings
Nor is the righteous `my common sense trumps their PhDs' attitude restricted to the stereotypical hick. The back page of Harper's magazine, the page most magazines reserve for the humorous finale, is the Findings section, that lists a series of out-of-context study results. From the March 2006 issue: “...It was discovered that guppies experience menopause and that toxic waste in the Arctic was turning polar bears into hermaphrodites. ...A survey found that Americans are becoming less repulsed by the sight of obese people. Scientists launched a study to determine what sorts of clothing make a woman's bottom look too big. A study found that Americans are more miserable today than they were in 1991, and British researchers discovered that many young girls enjoy mutilating their Barbie dolls.”

OK, what are we to make of this? What message is being sent? Mashing together the studies means that the findings do not add up to any real image of the world, even if the page does categorize the findings for some sense of flow. Readers can't drop these tidbits into cocktail party conversation, because they only have one piece of information and so aren't armed for even the simplest follow-up. Interested readers can't learn more, because there are no citations. More importantly, there is no context: we are not given the reason for studying guppy reproductive systems, so we don't know why a scientist would care to do such a thing.

Being the back page, we know that it's supposed to be humorous, and with everything taken out of context, it can be, the way that so many statements out of context or in a different context are funny. But there's also the sense of laughing at the scientists. The subject of every sentence (but the passive-voice ones) is a researcher or a study or a survey. If the editors just wanted to list facts, they'd say “Americans are becoming less repulsed...” but instead they waste ink pointing out that “A study found that Americans are becoming less repulsed...”.

If there were an American Association Against Science, they would probably reprint the Findings page verbatim. The AAAS would ask, in big red letters, "Why are we spending money on this?" and the answer to why would not be anywhere to be found.

But you know that I spend all day studying obscure features of people's behavior and reading math books, so it's no surprise that I'm anti-anti-intellectual. It's no secret that if I had an anti-intellectual in the room here, I'd tell him or her (reading from Harper's again) “New data suggested that Uranus is more chaotic than was previously thought.”

[See, statements in a different context are downright hilarious!]

But it goes further than my kind of academic. The anti-intellectual sentiment--the insistence that it's either common sense or it's not worth the trouble--is a belief that there is no such thing as an expert. It is the myopic belief that if I don't know it, then there's nothing to know. As such, the anti-intellectual sentiment is often aimed at targets well far afield from intellectuals.

At the Baltimore Museum of Art, the same establishment that houses Picasso's Mother and Child, are such aggressively simple works of art as two silkscreen reprints of the Last Supper, and a curtain of blue and silver beads. Some readers will recognize the first as a work by Andy Warhol, and thus know the context: Mr. Warhol felt that the repetition and mutation of familiar images created new perspectives. For the second, as for a great deal of art that was clearly easy to execute, we don't know the context at all.1 But even though we don't know it, there is a context. The guy went to art school, has had a few focal ideas that drove all his work, and has done years of pieces that led to this simple bead curtain.

So what is an expert to do? One approach is to always stick to things that are obscure and look hard. Make sure that every study, every work of art, every essay says fuc* you, I'm an expert and you can't do what I do. But we value people who make it look effortless, whether they're figure skating, producing a painting, or running regressions. We always value simplicity, so if all it takes to get across the message is a curtain of beads, then why overcomplicate things to remind the viewer that it took years of work to get there? Some of the best guitarists out there never really ventured past four chords, while the guys who can play intricate solos are often dubbed wankers.

I'm glad I wrote my PhD thesis, and more generally love the idea of a thesis in general, including for high school seniors, BAs, or anywhere in between. A good thesis means that the author has become an expert in some tiny, irrelevant little corner of the world. Research ability by itself is valuable, and it's good practice for when the student needs to be an authority in something of more practical value, but it also gives the student an idea of what the other experts of the world have gone through to get to their simple ends. Remember that part in Zoo Story where the guy says that “sometimes it's necessary to go a long distance out of the way in order to come back a short distance correctly”? A student who has gone a long way in becoming an expert, and must then reduce that to the sort of ten second summaries that we all give to friends and family, will have a better understanding of the long distance that other experts have gone before they could string together simple words or beads or chords.


Footnotes

... all.1
Sorry, I can't help the art snobs in the audience with the guy's name. Enjoy being in the dark with me here.



[link][no comments]

Comment!
Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:
Name:
E-Mail:
Homepage:
 
 


14 May 06. The Web as human network

[PDF version]

I'd like to discuss the question of how technology has changed personal relations. That'll come next time. For now, let's look at a specific, vaguely related question:does the link structure of the Net mirror the link structure of human networks?

Back when Alta Vista was the highest view in Internet search, a few IBM and Alta Vista researchers did a rather detailed study of the Web's structure (1). They, as with many others, found that the distribution of links on the Net looked a lot like the distribution of human links. There is a power law distribution where there are a few sites that are linked endlessly, and a long tail of sites that only have a few links.


Figure One: Junior high class photo. That's me on the far right.

To give an example of a power law, here is a graph based on data from junior high classes. The most popular student is on the X-axis at the far left (at X=0), and was nominated as a best friend by a mean of 9.75 other students (over 88 classrooms in the sample). Over on the other end of the X axis, the 25th through 35th ranked student in the classroom was nominated as a best friend by a mean of less than one other student. So you've got a few very well-connected students and a lot of students who have no connections at all.

We see this pattern in social networks of all scales, and among Web pages. The nomination count graph is typically a little more curvy than this one, with even more of a steep slope down from the most popular members of the group and a longer tail at the other end.

It sounds like the WWW as interpersonal network metaphor is working OK, but two caveats: first, there is much debate as to whether the best fit for the link distribution of the Web is a Negative Exponential, a Gamma, a Zipf, or a variety of other distributions that all look identical to a non-expert. Unless you hope to study this stuff seriously, you don't have to care about this caveat and can just call it a power law. The best fit to the student data is a Gamma distribution, by the way.

Second, human networks are pretty symmetric, in that there are few face-to-face contacts where one party is ignorant of the other. This is true of celebrities, whom we know but don't know us, but we can throw those out and have a reasonably symmetric set of acquaintance links. The popular kids may not want to hang out with the unpopular ones, but they know them nonetheless. But with Web pages, it happens all the time that a page makes no indication of what other pages are linking to it.


Figure Two: The Insidious Bowtie of Nyroth\ae{}nim, aka The Internet.

Broder et al found that this asymmetry occurs on a grand scale. They divide the Web into a giant Strongly Connected Component (SCC) comprising about a quarter of the Web; these are sites that interlink with each other. Then there's a quarter that only links in to the SCC but does not receive links. That would be blogs from losers like me. Then there's a quarter that is linked from the SCC but does not link to anything in particular, comprising corporate sites that just go in internal circles and things like online books and manual pages that are informative but not filled with links. The final quarter, they called <span class="airq">tendrils</span>, indicating a trail of limited links that doesn't readily fall into the first three categories. Thus, because a web page is not a person, the symmetry of human networks does not map to web links.

Another important distinction is that the whole small world game, where we try to find a chain of people from a guy in Katmandu to a guy in Omaha, does not work for the Net, because if you start on the right side of the bowtie, you can not get to the left side. For humans, you can almost certainly find a chain, and it'll be well under ten people in almost all cases; for the Net, you only have about a 25chance of being able to form a chain from any randomly selected site to any other randomly selected site. E.g., try getting from This haphazard site in Canada to this site here (hint: you can't). When you can form a chain, say from the in-feeding region to the SCC region, then it can still be hundreds of nodes long if one element is well-buried in a subculture.

Now, with human networks, we can distinguish between acquaintance, which is almost by definition symmetric, and friends, which is depressingly unidirectional, typically from low-status to high-status. I don't believe this metaphor is particularly well-studied, but it doesn't work very well. The net receivers of links for the Net are not high-status pages, but pages that just provide information (corporate, technical, whatever).

But getting back to the part of the metaphor that does work, there are two characteristics to both networks. First, there's a cost to linking both socially and online, because you need to find the subject of your interest and know them. Second, there is a cost to searching for new links. An immediate corollary to expensive search is a principle that the rich get richer: the easiest way to find new links for your own personal address book is to ask others for their contacts, so well-linked people/sites are more likely to get more links.

More on this next time.

(1) @articlebroder:net, title = "Graph Structure in the Web",
author= Andrei Broder and Ravi Kumar and Farzin Maghoul and Prabhakar Raghavan and Sridhar Rajagopalan and Raymie Stata and Andrew Tomkins and Janet Wiener,
journal = "Computer Networks",
volume = 33,
year = 2000,
pages = 309-320



[link][a comment]

on Thursday, May 18th, L-San Diego said

when did you put the h for human box in? what's it mean? i'm very fascinated about this.

please explain.

Oh, got it.

I'm not a machine! I'm a man!!!!!

Comment!
Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:
Name:
E-Mail:
Homepage:
 
 


26 May 06. Invariants

[PDF version]

This is about two technological revolutions that didn't happen, and aren't going to happen any time soon.

To some extent, this is also about a recent revolution in economics, where the study of how people interact has shown that there ain't nearly as much variation as we'd thought before: what we thought was wide variety is actually just a combination of invariants. More generally, it's a result of computational progress that has allowed us to pay more attention to distributions that are not in the Gaussian family (binomial, Normal, t, F, chi-squared) like the exponential, poisson, Zipf, &c.

The problem is that we humans have limits, and they have not in any way changed thanks to technology. The key limits are time and memory.

Who here bought R.E.M.'s Out of Time on vinyl or cassette?

The first result of these limits is the size of our comprehensible network. That is, how many people do I know well enough that I could hold a friendly conversation with them?

We can connect faster via cellular telephones, email, ntalk, or whatever point-and-talk technology has emerged since I wrote this, and so the time spent connecting is shorter, and we can cheaply connect to more distant people. But once the connection is made, we still have to resort to just talking or writing as before. This takes time, and the new toys don't speed this up at all.

Sure, you've got Friendster (or whatever the cool kids are using these days) allowing you to browse through photos of your pals, but back in the day, you had a paper address book, with scraps of everything hanging out of it, that let you do the same thing.

de Sola Pool and Kochen [C] made various attempts at estimating the number of acquaintances that a person has, and found that folks generally have about 1,500 immediate acquaintances whom they will see over the next two months once or twice and say hi to, and then about 4,500 less direct acquaintances, like the people from college whom they'll only see every few years. Perhaps our online networks have sort of blurred the lines on the close-by acquaintances and distant acquaintances, but how many hundreds of your high school pals have emailed you lately?

But that's all scale: what about structure? Are our social hierarchies flatter and more egalitarian now that we've got the Net? Again, no. We still see the same sort of pattern we saw in last episode: a few people who are very well connected and a lot of people who are minimally connected. The debate (about which I am no authority) is whether this is because some people have a higher capacity to maintain pals, due to more time dedicated to it or an innate name-and-face memory; or because of a rich-get-richer story that people find new pals via their old pals, so those who are well-networked will only wind up better-networked in the future. The true story is no doubt a bit of both.

Costly maintenance of links and costly search for new links have not changed for us humans. Generally, if you've got both of those characteristics, you're going to have a network that looks like standard social networks, and if those limits are set by the human brain and our 24 hour day, then the scale of those networks is set.

Content
Moving on from social networks, the second limit is in what we can produce. If you spent every minute of the next year typing away at your keyboard, your computer's hard drive would barely notice it all. [1 word= about 6 bytes. Given 60 words per minute times 1440 minutes per day = 518,400 bytes/day; in a year that's 180MB.] For most of us, everything we ever wrote would easily fit onto a single CD. That is, the technology of text processing has blown past the human ability to produce text.

For music and still pictures, we're in about the same place. The roadblock is not in storage and transmission, but in the process of finding artistic inspiration and the time and skill needed to execute it. Moving pictures are not far behind, and twenty years from now, downloading a movie won't take a moment's thought by anybody. Nobody will worry about the price of film stock, but the process of writing and producing a movie will still be a massive effort.

On the consumption side, it still takes 70 minutes to listen to Beethoven's Ninth, though you no longer have to get up and flip the disc in the middle. It still takes 90 minutes to watch a ninety-minute movie. The articles that I have on my hard drive in the `read any day now' pile has certainly grown, but the `articles I've read' pile grows at the slow, steady pace it always has, and the `articles I remember reading' pile continues to wither.

So scale is again set. As for structure, we find that there is again the same power-law type distribution in consumption. If we plot sales and Amazon sales rank on a log-log scale, we find that it's linear. In other words, the top ten best-selling books sell ten times as much as the bottom of the top 100, and those sell ten times as many as the bottom of the top 1,000, and so on down into the millions. [Below the top sellers, by the way, the ranking is basically the order of last sale, by the way.] That is, content is another power law, and that structure doesn't change with onlineness: before millions of blogs only read by three people, there were `zines only read by three people, and before that, letters.

So the distribution of book popularity happens to match the distribution of people popularity, which is no surprise, because the same two problems--costly search and costly linking/consumption--are an issue in both cases.

Policy implications
We are all more-or-less as networked as we're going to be by maybe age sixteen [socially; sexual networks follow different patterns from social networks, and tend to take more of a rich-get-richer form.[S]]. When you meet somebody new, they're crowding out somebody else, as time spent cultivating your new pal is not time spent cultivating the old. The same works for entire networks: just as advertisers must compete for your few dollars, networks must compete for your limited networking resources. Similarly, having a wealth of new content available just means that we have a wealth of things that we'll never read because they're crowded out by the other things we're reading.

I don't mean to say that the Web as a whole is a stagnant waste or that our information processing abilities are irrelevant. But with regards to certain basic human desires, we arrived about fifteen years ago when everybody got a PC, and everything since then has just been adding more features, giving you one more place where you can start a blog and one more list of contacts to keep synced.

[C] @articlepool:contacts,
author= de Sola Pool, Ithiel and Manfred Kochen,
title = Contacts and Influence,
journal = Social Networks,
volume = 1,
number = 1,
pages = 5-51,
year = 1978/79

[S] @articlelijeros:sex,
title = "The Web of Human Sexual Contacts",
author = Fredrik Lijeros and Christofer R Edling and Lus A Nunes
Amaral and H Eugene Stanley and Yvonne Åberg,
journal = "Nature",
volume = 411, day = 21, month = June, year = 2001,
pags=907-908




[link][no comments]

Comment!
Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:
Name:
E-Mail:
Homepage:
 
 


10 August 06. A time series analysis of Amazon sales rank

[PDF version]

I have been very interested in the sales of Math You Can't Use: Patents, Copyright, and Software, a book with which I was heavily involved. (Amazon page)

So naturally, I've been tracking the Amazon sales rank. At first, I did it the way everybody else does--refreshing the darn page every twenty minutes--but I have recently started doing it the civilized way--an automated script. Here is what I've learned about how Amazon does its rankings.

Background and conclusion
First, to give you some intuition as to sales rank, here's a little table:

1-10 Oprah's latest picks
10-100 The NYT's picks
100-1,000 Books by editors of Wired Magazine,
  topical rants by pundits/journalists,
  `classics'
1,000-500,000 everything else (still selling)
500,000-2mil everything else (technically in stock)

How much more detail can we get? The answer: none, really. You'll see below that over the course of a few days, the ranking of a typical book will go from 50,000 to 500,000, and a minute later it will be back at 50,000. Thus, the sort of things we usually do with a ranking, like compare two books, are unstable to the point of uselessness.

One thing you evidently can do with the ranking is determine whether a book has sold a copy in the last hour or two. As you'll see below, there's a simple formula that will work for most books: if (current rank) > (earlier rank) then there was a recent sale.

The first chart
Here is a graph of sales of Math You Can't Use. A data point is added every three hours, so if you come back in a week, this graph will be different. See below for the code I used to generate this. On the x -axis is the date, and on the y -axis is the Amazon sales rank at that time.

Sales rank for Klemens's {<EM>Math You Can't Use</EM>}
Sales rank for Klemens's Math You Can't Use

You can see from the graph that the pattern is a sudden jump and then a slow drift downward. The clearest explanation is that the sales rank is basically a function of last sale. When a copy sells, the book jumps to a high rank, and then gets knocked down one unit every time any lower-ranked book sells.

There are lots of details that those of us not working at Amazon will never quite catch. There are periods (sometimes mid-day) when the rank drifts down more slowly than it should, then speeds up in its descent. This implies to me some computational approximations that eventually get corrected. You'll notice that some of the books below show a small slope upward (a ten or twenty point rise in ranking) from time to time. When this happens, lots of books do it at once, also indicating some sort of correction whose purpose or method I don't have enough information to divine. Epstein and Axtell's book rises appreciably when it nears half a million. Filally, I don't have enough data to determine whether the ranking distinguishes between sales of used and new copies; I don't think it does.

Others

Here is a haphazard sampling of other books. Again, these are dynamically regenerated every three hours, so come back later for more action-packed graphing. Some of these books bear something in common with Math You Can't Use, and others were based on a trip to the used book store I'd made the other day. Some have hardcover and paperback editions, in which case I just graph the paperback.

Epstein and Axtell's Growing Artificial Societies [Amazon p.]

Sales rank for Epstein and Axtell's {<EM>Growing Artificial
Societies</EM>}
Sales rank for Epstein and Axtell's Growing Artificial Societies

Andy Rathbone, Tivo for Dummies. I have no idea who would buy this, and yet it is the best nonfiction seller here. This proves that I must never go into marketing. [Amazon p.]

Sales rank for Rathbone's {<EM>Tivo for Dummies</EM>}
Sales rank for Rathbone's Tivo for Dummies

Dickens's Great Expectations, Penguin Classics ed. [Amazon p.] Books in the top 10,000 or so are selling several copies a day, so the pattern looks different.

Sales rank for {<EM>Great Expectations</EM>}
Sales rank for Great Expectations

Madonna's Sex. [Amazon p.] Somebody ran into the used bookstore asking for a copy, and ran out when the owner said he didn't have one. It's amusing that a book from 1992 could still instill such fervor in a person. It sells new for $125, used around $85.

Sales rank for Madonna's {<EM>Sex</EM>}
Sales rank for Madonna's Sex

Ian McEwan's Atonement. [Amazon p.] I really thought I'd hate this book, since it starts off as being about subtle errors in manners committed by a gathering of relatives and friends at a British country manor, but it turned out to be an interesting modern take on the genre.

Sales rank for McEwan's {<EM>Atonement</EM>}
Sales rank for McEwan's Atonement

Executive summary
At this point, I'm not sure why Amazon ranks books below the top thousand, except for a sort of geek factor. For all of the books here, it is basically impossible to say something like `Tivo for Dummies is ranked around 100,000,' since the ranking jumps by an order of magnitude almost daily. Similarly, there's no point saying `Atonement is ranked higher than Great Expectations', since you have a 50-50 chance of being wrong tomorrow. All we get is a very broad ballpark figure (a football field figure?), and a too-good impression of how many hours ago the last sale was made.

Those of us interested in the sales rank of books outside Oprah's picks would be better served if the system were less volatile. In technical terms, if my guess that the score experiences exponential decay is correct, then the ranking system would be more useful to those of us watching the long tail if the decay factor were set to a smaller value.

Technical notes

The data looks to me like an exponential decay system, where you have a current score St which goes up by some amount every sale, but drifts down by some discount rate every period, St+1 = λSt . [Thus, if there were no sales events, your score would be St = S0exp(- λt) .]

To fit this, I flipped and renormalized the rankings so that one was the highest possible ranking, and zero corresponded to a ranking of 500,000. Then, I set the following algorithm:
The score was initialized at 0.58.
Each period, score is multiplied (shrinks) by a factor of 0.96.
If there is a sale, then score rises by the addition of (1-current score) * 0.79.

As you can imagine, I found those constants via minimizing the distance between the estimate and the actual. The algorithm is an exponential decay model with λ = 0.96 , and upward shocks as described. The only way I could fit the data was to make shocks when the book is at a low sales rank bigger than shocks when it has a high sales rank. There's surely a more clever way to do it.

The green line shows the exponential decay model fit to the actual data. You can decide if this is a good fit or a lousy one.

The
sales rank data and a line fitting to the data. The line curves more than
the real data, and its jumps are typically not as high.
My attempts to fit the Amazon sales rank to an exponential model

You can also have a look at how the model fit to Madonna's book.

The code
For the geeks with their own stuff to track, here's my code. You can see that it's Python glue holding together a number of tools that I take to be common POSIX tools, like wget and a copy of Gnuplot recent enough to render the PNG format. The main loop (while (1)) just runs the checkrank function, writes the time and rank to a file, and then calls Gnuplot to plot the file. The checkrank function downloads the book's page, searches for the phrase `Amazon.com Sales Rank: ###' and returns the ### part.

Usage:

import amazon   #assuming you've saved the file below as amazon.py.
amazon.onebookloop("03030303", "output_graph.png")
Notice that the ASIN number should be a text string, since if it were a number an initial zero would often be dropped, and some ASIN have Xes at the end.

This is pretty rudimentary; in the spirit of open source, I'd be happy to post your improvements.

#!/usr/bin/python
#(c)2006 Eric Blair. Licensed under the GNU GPL v 2.
import re, os, time

def checkrank(asin):
	(a,b,wresult)	= os.popen3("""
	wget "http://amazon.com/gp/product/%s/" -O -
	""" %(asin,))
	exp	=re.compile(R'Amazon.com Sales Rank:</b>  #([^ ]*) in Books')
	result  = exp.search(b.read())
	if result is not None:
		return result.groups()[0]
	else:
		return None

def onebookloop(asin, outfile):
	if not os.path.isfile(outfile):
		f = open(outfile, 'w')
		f.write ("""set term png
				set xdata time
				set timefmt "%Y; %m; %d; %H;"
				set yrange [1:*] reverse
				plot '-' using 1:5 title "sales rank"
			""")
		f.close()
	while (1):
		f   = open("rankings.%s" % (asin,), 'a');
		t   = time.localtime()
		r   = None
		while r is None:
			r	= checkrank(asin)
			if r is None:
				time.sleep(10)
		f.write("%i; %i; %i; %i; %s\n"% ( t.tm_year, t.tm_mon, t.tm_mday, t.tm_hour, r));
		f.close()
		os.popen(""" 
			sed -e 's/,//g' < rankings.%s | gnuplot > %s
		""" % (asin,outfile) )
		time.sleep(3*60*60) #3 hours.




[link][5 comments]

on Thursday, August 10th, techne said

Very fun. If only I had a book to track.

on Thursday, August 10th, Andy said

You can also use this formula to judge how Amazon's business is doing overall. When the book drifts down at a faster rate, that means that more books underneath it are selling; when it drifts down more slowly, fewer books below it are being sold.

on Thursday, August 10th, Miss ALS of San Diego, of course said

You're a geek.

I know why Tivo for Dummies sells--how funny would that be as a gift Christmas morning with your tivo? Answer? Funny. Very very funny. We had a good laugh at the bookstore as I recall.

on Sunday, August 20th, AC said

Heh. Neat.

on Sunday, March 25th, Mike said

I know why Tivo for Dummies sells -- I work at a call center for Directv, and get calls all the time like: "How do I record a show, How do I erase a show, how do I set up to record something regularly?". It really makes one sad for the future of society because more than half of theese people have VCRs and can use them well. Just remember... It may be obvious, it may be EXACTLY THE SAME as something they ALREADY USE, but it ISN'T what they already use, so obviously their pre-existing knowledge is worthless. This holds true for web sites and computer programs as well.

Comment!
Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:
Name:
E-Mail:
Homepage:
 
 


10 September 06. The statistics style report

[PDF version]

It may sound like an oxymoron, but there is such a thing as fashionable statistical analysis. Where did this come from? How is it that our tests for Truth, upon which all of science relies, can vacillate from season to season like hemlines?

Before answering that question, note that statistics as a whole is not arbitrary. The Central Limit Theorem is a mathematical theorem like any other, and if you believe the basic assumptions of mathematics, you have to believe the CLT. The CLT and developments therefrom were the basis of stats for a century or two there, from Gauss on up to the early 1900s when the whole system of distributions (Binomial, Bernoulli, Gaussian, t, chi-squared, Pareto) was pretty much tied up. Much of this, by the way, counts not as statistics but as probability.

Next, there's the problem of using these objective truths to describing reality. That is, there's the problem of writing models. Models are a human invention to describe nature in a human-friendly manner, and so are at the mercy of human trends. Allow me to share with you my arbitrary, unsupported, citation-free personal observations.

Number crunching
The first thread of trendiness is technology-driven. In every generation, there's a line you've got to draw and say `everything after this is computationally out of reach, so we're assuming it away', and the assume-it-away line drifts into the distance over time. Here's a little something from a 1939 stats textbook on fitting time trends:

To fit a trend by the freehand method draw a line through a graph of the data in such a way as to describe what appears to the eye to be the long period movement. ...The drawing of this line need not be strictly freehand but may be accomplished with the aid of transparent straight edge or a “French” curve.

As you can imagine, this advice does not appear in more recent stats texts. In this respect, a stats text can actually become obsolete. However, true and honest approximations like this are relatively rare. Instead, more computing power allows new paradigms that were before just written off as impossible.

Computational ability has brought about two revolutions in statistics. The first is the linear projection (aka, regression). Running a regression requires inverting a matrix, with dimension equal to the number of variables in the regression. A two-by-two matrix is easy to invert (ad - bc mathend000#, remember?) but it gets significantly more computationally difficult as the number of variables rises. If you want to run a ten-variable regression using a hand calculator, you'll need to set aside a few days to do the matrix inversion. My laptop will do the work in 0.002 seconds. It's still in under a second up to about 500 by 500, but 1,000 by 1,000 took 5.08 seconds. That includes the time it took to generate a million random numbers.

So revolution number one, when computers first came out, was a shift from simple correlations and analysis of variance and covariance to linear regression. This was the dominant paradigm from when computers became common until a few years ago.

The second revolution was when computing power became adequate to do searches for optima. Say that you have a simple function to take in inputs and produce an output therefrom. Given your budget for inputs, what mix of inputs maximizes the output? If you have the function in a form that you can solve algebraically, then it's easy, but let us say that it is somehow too complex to solve via Lagrange multipliers or what-have-you, and you need to search for the optimal mix.

You've just walked in on one of the great unsolved problems of modern computing. All your computer can do is sample values from the function--if I try these inputs, then I'll get this output--and if it takes a long time to evaluate one of these samples, then the computer will want to use as few samples as possible. So what is the method of sampling that will find the optimum in as few samples as possible? There are many methods to choose from, and selecting the best depends on enough factors that we call it an art more than a science.

In the statistical context, the paradigm is to look at the set of input parameters that will maximize the likelihood of the observed outcome. To do this, you need to check the likelihood of every observation, given your chosen parameters. For a linear regression, the dimension of your task was equal to the number of regression parameters, maybe five or ten; for a maximum likelihood calculation, the dimension is related to the number of data points, maybe a thousand or a million. Executive summary: the problem of searching for a likelihood function's optimum is significantly more computationally intensive than running a linear regression.

So it is no surprise that in the last twenty years, we've seen the emergence of statistical models built on the process of finding an optimum for some complex function. Most of the stuff below is a variant on the search-the-space method. But why is the most likely parameter favored over all others? There's the Cramer-Rao Lower Bound and the Neyman-Pearson Lemma, but in the end it's just arbitrary. Gauss had no theorems that this framework gives superior models relative to linear projection, but it does make better use of computing technology.

Hemlines
The second thread of statistical fashion is whim-driven like any other sort of fashion. Golly, the population collectively thinks, everybody wore hideously bright clothing for so long that it'd be a nice change to have some understated tones for a change. Or: now that music engineers all have ProTools, everything is a wall of sound; it'd be great to just hear a guy with a guitar for a while. Then, a few years later, we collectively agree that we need more fun colors and big bands. Repeat the cycle until civilization ends.

Statistical modeling sees the same cycles, and the fluctuation here is between the parsimony of having models that have few moving parts and the descriptiveness of models that throw in parameters describing the kitchen sink. In the past, parsimony won out on statistical models because we had the technological constraint.

If you pick up a stats textbook from the 1950s, you'll see a huge number of methods for dissecting covariance. The modern textbook will have a few pages describing a Standard ANOVA (analysis of variance) Table, as if there's only one. This is a full cycle from simplicity to complexity and back again. Everybody was just too overwhelmed by all those methods, and lost interest in them when linear regression became cheap.

Along the linear projection thread, there's a new method introduced every year to handle another variant of the standard model. E.g., last season, all the cool kids were using the Arellano-Bond method on their time series so they could assume away endogeneity problems. The list of variants and tricks has filled many volumes. If somebody used every applicable trick on a data set, the final work would be supremely accurate--and a terrible model. The list of tricks balloons, while the list of tricks used remains small or constant. Maximum likelihood tricks are still legion, but I expect that the working list will soon find itself pared down to a small set as optimum finding becomes standardized.

In the search-for-optima world, the latest trend has been in `non-parametric' models. First, there has never been a term that deserved air-quotes more than this. A `non-parametric' model searches for a probability density that describes a data set. The set of densities is of infinite dimension. If all you've got a hundred data points, you ain't gonna find a unique element of mathend000# with that. So instead, you specify a certain set of densities, like sums of Normal distributions, and then search for that subset that leads to a nice fit to the data. You'll wind up with a set of what we call parameters that describe that derived distribution, such as the weights, means, and variances of the Normal distributions being summed.

But `non-parametric' models allow you to have an arbitrary number of parameters. Your best fit to a 100-point data set is a sum of 100 Normal distributions. If you fit 100 points with 100 parameters, everybody would laugh at you, but it's possible. In that respect, the `non-parametric' setup falls on the descriptive end of the descriptive-parsimonious end of the scale. In my opinion.

I don't want to sound mean about `non-parametric' methods, by the way. It's entirely valid to want to closely fit data, and I have used the method myself. But I really think the name is false advertising. How about distribution-fitting methods or optimal distribution estimation?

Bayesian methods are increasingly cool. There are the computational problems, that if you want to assume something more interesting than Normal priors and likelihoods, then you need a computer. Those have been surmounted, leaving us with the philosophy issues. In the context here, those boil down to parsimony. Your posterior distribution may be even weirder than a multi-humped sum of Normals, and the only way to describe it may just be to draw the darn graph. Thus, Bayesian methods are also a shift to the description-over-parsimony side.

Method of Moments estimators have also been hip lately. I frankly don't know where that's going, because I don't know them very well.

Also, this guy really wants multilevel modeling to be the Next Big Thing in the linear model world, and makes a decent argument for that.

You can see that the increasing computational ability invites shifting away from parsimony. Since PCs really hit the world of day-to-day stats recently, we're in the midst of a swing toward description. We can expect an eventual downtick toward simpler models, which will be helped by the people who write stats packages--as opposed to the researchers who caused the drift toward complexity--because they write simple routines that implement these methods in the simplest way possible.

So is your stats textbook obsolete? It's probably less obsolete than people will make it out to be. The basics of probability have not moved since the Central Limit Theorems were solidified. In the end, once you've picked your paradigm, there aren't really many methods out there for truly and honestly cutting corners; most novelties are just about doing detailed work regarding a certain type of data or set of assumptions. Further, those linear projection methods or correlation tables work pretty well for a lot of purposes.

But the fashionable models that are getting buzz shift every year, and last year's model is often considered to be naïve or too parsimonious or too cluttered or otherwise an indication that the author is not down with the cool kids--and this can affect peer review outcomes. A textbook that focuses on the sort of details that were pressing ten years ago, instead of just summarizing them in a few pages, will have to pass up on the detailed tricks the cool kids are coming up with this season--which will in turn affect peer reviews for papers writen based on the textbook's advice. All this is entirely frustrating, because we like to think that our science is searching for some sort of true reflection of constant reality, yet the methods that are acceptable for seeking out constant reality depend upon a bit more on human whim than I'd really like.




[link][2 comments]

on Monday, September 11th, Andy said

Interesting idea that methods as well as theories can go through paradigm shifts. But how do you know that this doesn't really represent progress? Regressions are more powerful than (i.e. a superset of) AN(C)OVA, and we ain't going back to those old days. So there is a competitive process of creative destruction, yadda yadda, until the best stats win. For example, take Huber-White robust standard erros. Or Dickens/Moulton style clustered standard errors. Nowadays people just use them without making a big deal about it, because they work. Maybe A-B will be in that same category ten years from now. Really, one problem I always have is figuring out how much the reader knows already and how much I should spell out.

on Monday, September 11th, Miss ALS of San Diego said

I think the truly interesting thing about shifts in methods is that until they are considered 'street wear' and not just haute coutour, those using the latest thing have to examine (and, gasp, explain) their assumptions. People forget all about the requirements for OLS all_the_frickin_time...but people see OLS and they know how to interpret the results, so they don't bother figuring out whether OLS is appropriate. if you're using a Bayesian technique (which is also horribly named, i.m.o), you're got to convince people that your priors are reasonable, you've got to have a deeper understanding of the method because uber-human friendly programs like stata won't just chug it out for you.

Comment!
Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:
Name:
E-Mail:
Homepage:
 
 


26 October 06. Is Ruby halal?

[PDF version]

The starting point here is last episode's essay on programming languages, and this here is basically an explanation and generalization of why I wrote it. For those who didn't read it (and I don't blame ya), here's a summary in the form a description of my ideal girlfriend: she should be an Asian Jewess, around 172-174cm tall, gothy, sporty, significantly smarter than me, significantly cuter than me, significantly better socialized than me, willing to hang out with me, very well organized but endlessly spontaneous, enjoys walks along the beach, does intellectually challenging work that involves being outdoors, and plays guitar in a rock band. Yeah.

So: too bad half of those things contradict the other half, eh.

The first key difference between the problem of picking a programming language and the problem of picking a significant other is that the programming language doesn't have to like you back. The liking-you-back issue creates many volumes' worth of interesting stories, all of which I will ignore here, in favor of the the other key difference: unlike many girl/boyfriends, programs are often shared among friends and coworkers, meaning that there are externalities in my arbitrary, personal-preference choice.

Personal preference plus externalities is the perfect recipe for never-ending, repetitive debate.

Debating the undebatable
Under Jewish law, one must never say the Name of God. In fact, there is none--it's sort of a mythical incantation, used to breathe life into Golems and otherwise tell monotheistic fairy tales. Under Islamic law, one must speak the Name of God when slaughtering an animal for the animal's flesh to be halal. My reading here is that there is therefore no way for meat to be both halal and kosher.

And let's note, by the way, that kosher and halal laws are not cast as rules about keeping clean for the sake of disease prevention. They're ethical laws, meaning that, like personal preference, they can't really be debated. It's not like somebody will finally find the correct answer and write it down for everybody to see. We can't even agree to basic axioms like `you should be nice to people' or `don't be wasteful'.

Do ethical laws induce externality problems? From the looks of it, yes they do, because so many people spend so much time trying to get other people to conform to their personal ethics. Ethics are an extreme form of that other personal preference, æsthetics, and seeing somebody commit what you consider to be an unethical act is often on par with watching somebody wearing a floppy brown sweater with spandex safety orange tights.

Fortunately, almost everybody understands that there is no point going up to Mr. Brown-and-orange and telling him he needs to change, because we all know exactly how the conversation will go: some variant of `I have my own personal preferences' or `who are you to impose your arbitrary choices upon me'. That is, it would be a boring argument, because there is fundamentally no right answer.

When does human life begin? I have no idea, and anybody who says otherwise is guilty of hubris.

Gee, that was a fun debate, wasn't it.

And the problem with that non-debate, as with this essay, is that it has no emotionally satisfying conclusion. The natural form of a debate is for one side to present its best arguments, the other side to present its own, and then both sides go home and think about it. But the form of debate that is emotionally satisfying has a resounding conclusion, where one side tearfully confesses to the other, `OK, I was wrong!' But with arguments of ethics or personal preference, this sort of resolution happens about once every never.

But there's a simple way to fix this problem: invent statistics.

After all, not all debates are mere issues of personal preference. A question like `will building this road or starting this war improve the economy' has a definite answer, though we're typically not smart enough to know it. There is valid grounds for debate there.

But for ethics and personal preference issues, we can still make it look like there are valid grounds for debate. Find out whether abortions decrease crime The paper that claims this, by Steven “Freakonomics” Leavitt and another not-famous economist, has been shown to be based on erroneous calculations. PDF, find out whether people commit more errors when commas are used as separators or terminators, run benchmarks, accuse the author of the file system you don't like of being a murderer. With enough haphazard facts, any debate about pure personal preference regarding simple trade-offs can be extended to years of tedium.

This turns debates that should be of the natural form (both sides state opinions, then go home) into the resounding form of debate, where both sides attempt to get the other side to tearfully confess the errors of its ways. But the sheen of facts doesn't change the fundamental nature of debates over ethics or personal preference, and because these are debates where nobody is actually wrong, nobody will ever be convinced to bring about an emotionally satisfying conclusion. We instead simply have a new variant on the recipe for tedious, never-ending debate.

Relevant previous entries:
The one three years ago when I advocated C, and came off sounding very reasonable, I think. The one where I complain about an especially vehement set of proselytizers The one where I talk about the value of stable standards




[link][2 comments]

on Sunday, October 29th, rd said

-i think Levitt claims the mistakes weren't significant and dont alter his main conclusions

-your claim that ethical debates are fundamentally unresolvable might also be a matter of opinion. Eg, some people might think that human life starts at x and this can be proven axiomatically, we just havent figured out how to do it yet. it's possble (but highly unlikely?) that at some pt 'somebody will finally find the correct answer and write it down for everybody to see', no?

on Monday, October 30th, the author said

Yes, Donohue and Levitt has a response (PDF) to the claims. I have not had time to really look at the `metrics on any of the papers involved; if anybody has, let me know. But the key allegation is that the original abortion-prevents-crime paper used ln(arrests) as the dependent variable, and if you redo the regression with the much more sensible ln(arrests per capita) and jiggle the variables a bit, then the effect disappears. In the response linked above, Donohue and Levitt respond that if you do use ln(arrests per capita), and jiggle the variables more then the effect reappears.

In short, we have a specification fight. I think we should throw out specifications of the regression with ln(arrests). The authors of the critique, Foote and Goetz, found a valid specification where abortions have no relation to crime, and then Donohue and Levitt found another specification where abortions reduce crime. For my part, all I can do is modestly and respectfully comment that I told you so.

Comment!
Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:
Name:
E-Mail:
Homepage: