Patterns in static

A time series analysis of Amazon sales rank

navigational aids:

News ticker:

topics covered:

the feedback logo. It rotates.

10 August 06.

[PDF version]

I have been very interested in the sales of Math You Can't Use: Patents, Copyright, and Software, a book with which I was heavily involved. (Amazon page)

So naturally, I've been tracking the Amazon sales rank. At first, I did it the way everybody else does--refreshing the darn page every twenty minutes--but I have recently started doing it the civilized way--an automated script. Here is what I've learned about how Amazon does its rankings.

Background and conclusion
First, to give you some intuition as to sales rank, here's a little table:

1-10 Oprah's latest picks
10-100 The NYT's picks
100-1,000 Books by editors of Wired Magazine,
  topical rants by pundits/journalists,
1,000-500,000 everything else (still selling)
500,000-2mil everything else (technically in stock)

How much more detail can we get? The answer: none, really. You'll see below that over the course of a few days, the ranking of a typical book will go from 50,000 to 500,000, and a minute later it will be back at 50,000. Thus, the sort of things we usually do with a ranking, like compare two books, are unstable to the point of uselessness.

One thing you evidently can do with the ranking is determine whether a book has sold a copy in the last hour or two. As you'll see below, there's a simple formula that will work for most books: if (current rank) > (earlier rank) then there was a recent sale.

The first chart
Here is a plot of sales of Math You Can't Use. A data point is added every three hours, so if you come back in a week, this plot will be different. See below for the code I used to generate this. On the x-axis is the date, and on the y-axis is the Amazon sales rank at that time.

Sales rank for Klemens's {<EM>Math You Can't Use</EM>}
Sales rank for Klemens's Math You Can't Use

You can see from the plot that the pattern is a sudden jump and then a slow drift downward. The clearest explanation is that the sales rank is basically a function of last sale. When a copy sells, the book jumps to a high rank, and then gets knocked down one unit every time any lower-ranked book sells.

There are lots of details that those of us not working at Amazon will never quite catch. There are periods (sometimes mid-day) when the rank drifts down more slowly than it should, then speeds up in its descent. This implies to me some computational approximations that eventually get corrected. You'll notice that some of the books below show a small slope upward (a ten or twenty point rise in ranking) from time to time. When this happens, lots of books do it at once, also indicating some sort of correction whose purpose or method I don't have enough information to divine. Epstein and Axtell's book rises appreciably when it nears half a million. Finally, I don't have enough data to determine whether the ranking distinguishes between sales of used and new copies; I don't think it does.


Here is a haphazard sampling of other books. Again, these are dynamically regenerated every three hours, so come back later for more action-packed graphing. Update 25 June 2008: I've switched to a host that doesn't have gnuplot, so the plots are no longer updated; you've got two years of data, and that's it. Some of these books bear something in common with Math You Can't Use, and others were based on a trip to the used book store I'd made the other day. Some have hardcover and paperback editions, in which case I just plot the paperback.

Epstein and Axtell's Growing Artificial Societies [Amazon p.]

Sales rank for Epstein and Axtell's {<EM>Growing Artificial
Sales rank for Epstein and Axtell's Growing Artificial Societies

Andy Rathbone, Tivo for Dummies. I have no idea who would buy this, and yet it is the best nonfiction seller here. This proves that I must never go into marketing. [Amazon p.]

Sales rank for Rathbone's {<EM>Tivo for Dummies</EM>}
Sales rank for Rathbone's Tivo for Dummies

Dickens's Great Expectations, Penguin Classics ed. [Amazon p.] Books in the top 10,000 or so are selling several copies a day, so the pattern looks different.

Sales rank for {<EM>Great Expectations</EM>}
Sales rank for Great Expectations

Madonna's Sex. [Amazon p.] Somebody ran into the used bookstore asking for a copy, and ran out when the owner said he didn't have one. It's amusing that a book from 1992 could still instill such fervor in a person. It sells new for $125, used around $85.

Sales rank for Madonna's {<EM>Sex</EM>}
Sales rank for Madonna's Sex

Ian McEwan's Atonement. [Amazon p.] I really thought I'd hate this book, since it starts off as being about subtle errors in manners committed by a gathering of relatives and friends at a British country manor, but it turned out to be an interesting modern take on the genre. Update: After I read it, it turned into a movie; you can see in the plot when it was in theaters.

Sales rank for McEwan's {<EM>Atonement</EM>}
Sales rank for McEwan's Atonement

Executive summary
At this point, I'm not sure why Amazon ranks books below the top thousand, except for a sort of geek factor. For all of the books here, it is basically impossible to say something like `Tivo for Dummies is ranked around 100,000,' since the ranking jumps by an order of magnitude almost daily. Similarly, there's no point saying `Atonement is ranked higher than Great Expectations', since you have a 50-50 chance of being wrong tomorrow. All we get is a very broad ballpark figure (a football field figure?), and a too-good impression of how many hours ago the last sale was made.

Those of us interested in the sales rank of books outside Oprah's picks would be better served if the system were less volatile. In technical terms, if my guess that the score experiences exponential decay is correct, then the ranking system would be more useful to those of us watching the long tail if the decay factor were set to a smaller value.

Technical notes

The data looks to me like an exponential decay system, where you have a current score St which goes up by some amount every sale, but drifts down by some discount rate every period, St+1 = λSt. [Thus, if there were no sales events, your score would be St = S0exp(- λt).]

To fit this, I flipped and renormalized the rankings so that one was the highest possible ranking, and zero corresponded to a ranking of 500,000. Then, I set the following algorithm:
The score was initialized at 0.58.
Each period, score is multiplied (shrinks) by a factor of 0.96.
If there is a sale, then score rises by the addition of (1-current score) * 0.79.

As you can imagine, I found those constants via minimizing the distance between the estimate and the actual. The algorithm is an exponential decay model with λ = 0.96, and upward shocks as described. The only way I could fit the data was to make shocks when the book is at a low sales rank bigger than shocks when it has a high sales rank. There's surely a more clever way to do it.

The green line shows the exponential decay model fit to the actual data. You can decide if this is a good fit or a lousy one.

sales rank data and a line fitting to the data. The line curves more than
the real data, and its jumps are typically not as high.
My attempts to fit the Amazon sales rank to an exponential model

You can also have a look at how the model fit to Madonna's book.

The code
For the geeks with their own stuff to track, here's my code. You can see that it's Python glue holding together a number of tools that I take to be common POSIX tools, like wget and a copy of Gnuplot recent enough to render the PNG format. The main loop (while (1)) just runs the checkrank function, writes the time and rank to a file, and then calls Gnuplot to plot the file. The checkrank function downloads the book's page, searches for the phrase ` Sales Rank: ###' and returns the ### part.


import amazon   #assuming you've saved the file below as
amazon.onebookloop("03030303", "output_graph.png")
Notice that the ASIN number should be a text string, since if it were a number an initial zero would often be dropped, and some ASIN have Xes at the end.

This is pretty rudimentary; in the spirit of open source, I'd be happy to post your improvements.

#(c)2006 Eric Blair. Licensed under the GNU GPL v 2.
import re, os, time

def checkrank(asin):
	(a,b,wresult)	= os.popen3("""
	wget "" -O -
	""" %(asin,))
	exp	=re.compile(R' Sales Rank:</b>  #([^ ]*) in Books')
	result  =
	if result is not None:
		return result.groups()[0]
		return None

def onebookloop(asin, outfile):
	if not os.path.isfile(outfile):
		f = open(outfile, 'w')
		f.write ("""set term png
				set xdata time
				set timefmt "%Y; %m; %d; %H;"
				set yrange [1:*] reverse
				plot '-' using 1:5 title "sales rank"
	while (1):
		f   = open("rankings.%s" % (asin,), 'a');
		t   = time.localtime()
		r   = None
		while r is None:
			r	= checkrank(asin)
			if r is None:
		f.write("%i; %i; %i; %i; %s\n"% ( t.tm_year, t.tm_mon, t.tm_mday, t.tm_hour, r));
			sed -e 's/,//g' < rankings.%s | gnuplot > %s
		""" % (asin,outfile) )
		time.sleep(3*60*60) #3 hours.

[link] [9 comments]
[Previous entry: "The abject failure of IP PR"]
[Next entry: "The continuing Byzantine-Ottoman war"]

Replies: 9 comments

on Thursday, August 10th, techne said

Very fun. If only I had a book to track.

on Thursday, August 10th, Andy said

You can also use this formula to judge how Amazon's business is doing overall. When the book drifts down at a faster rate, that means that more books underneath it are selling; when it drifts down more slowly, fewer books below it are being sold.

on Thursday, August 10th, Miss ALS of San Diego, of course said

You're a geek.

I know why Tivo for Dummies sells--how funny would that be as a gift Christmas morning with your tivo? Answer? Funny. Very very funny. We had a good laugh at the bookstore as I recall.

on Sunday, August 20th, AC said

Heh. Neat.

on Sunday, March 25th, Mike said

I know why Tivo for Dummies sells -- I work at a call center for Directv, and get calls all the time like: "How do I record a show, How do I erase a show, how do I set up to record something regularly?". It really makes one sad for the future of society because more than half of theese people have VCRs and can use them well. Just remember... It may be obvious, it may be EXACTLY THE SAME as something they ALREADY USE, but it ISN'T what they already use, so obviously their pre-existing knowledge is worthless. This holds true for web sites and computer programs as well.

on Tuesday, March 17th, jonathan yates said

I have no idea how to use your script I have been trying for days to get the following working but to no avail - any ideas ? Jonathan

import os
import string

com="wget -O -";



print str(fc)

on Tuesday, March 17th, the author said


I'm not much of an expert on the details of Python, to tell you the truth, just decent enough to bang together scripts like this. So I don't think I could help you debug the partial script you put up here. There are many sites that will help you better than I could; e.g., everything I know I learned from Dive into Python.

on Wednesday, August 11th, The Bargain Book Mole said

Impressive looking work for sure- My brain would like to draw some conclusions about sales rank and frequency of sales per month, but I am stuck wrapping my head around your math :)

on Wednesday, January 12th, Le Creuset on Sale said

Sales rank is just a figure to take into consideration. There are many other factors that are just as important (i.e. category, usefulness, etc). Even though one can correlate sales rank to this items, a lot also comes down to feel.

Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human: