| A time series analysis of Amazon sales rank |
|
navigational aids: News ticker:
|
10 August 06.
I have been very interested in the sales of Math You Can't Use: Patents, Copyright, and Software, a book with which I was heavily involved. (Amazon page) So naturally, I've been tracking the Amazon sales rank. At first, I did it the way everybody else does--refreshing the darn page every twenty minutes--but I have recently started doing it the civilized way--an automated script. Here is what I've learned about how Amazon does its rankings.
Background and conclusionFirst, to give you some intuition as to sales rank, here's a little table:
How much more detail can we get? The answer: none, really. You'll see below that over the course of a few days, the ranking of a typical book will go from 50,000 to 500,000, and a minute later it will be back at 50,000. Thus, the sort of things we usually do with a ranking, like compare two books, are unstable to the point of uselessness. One thing you evidently can do with the ranking is determine whether a book has sold a copy in the last hour or two. As you'll see below, there's a simple formula that will work for most books: if (current rank) > (earlier rank) then there was a recent sale.
The first chartHere is a plot of sales of Math You Can't Use. A data point is added every three hours, so if you come back in a week, this plot will be different. See below for the code I used to generate this. On the x-axis is the date, and on the y-axis is the Amazon sales rank at that time.
You can see from the plot that the pattern is a sudden jump and then a slow drift downward. The clearest explanation is that the sales rank is basically a function of last sale. When a copy sells, the book jumps to a high rank, and then gets knocked down one unit every time any lower-ranked book sells. There are lots of details that those of us not working at Amazon will never quite catch. There are periods (sometimes mid-day) when the rank drifts down more slowly than it should, then speeds up in its descent. This implies to me some computational approximations that eventually get corrected. You'll notice that some of the books below show a small slope upward (a ten or twenty point rise in ranking) from time to time. When this happens, lots of books do it at once, also indicating some sort of correction whose purpose or method I don't have enough information to divine. Epstein and Axtell's book rises appreciably when it nears half a million. Finally, I don't have enough data to determine whether the ranking distinguishes between sales of used and new copies; I don't think it does.
OthersHere is a haphazard sampling of other books. Again, these are dynamically regenerated every three hours, so come back later for more action-packed graphing. Update 25 June 2008: I've switched to a host that doesn't have gnuplot, so the plots are no longer updated; you've got two years of data, and that's it. Some of these books bear something in common with Math You Can't Use, and others were based on a trip to the used book store I'd made the other day. Some have hardcover and paperback editions, in which case I just plot the paperback. Epstein and Axtell's Growing Artificial Societies [Amazon p.]
Andy Rathbone, Tivo for Dummies. I have no idea who would buy this, and yet it is the best nonfiction seller here. This proves that I must never go into marketing. [Amazon p.]
Dickens's Great Expectations, Penguin Classics ed. [Amazon p.] Books in the top 10,000 or so are selling several copies a day, so the pattern looks different.
Madonna's Sex. [Amazon p.] Somebody ran into the used bookstore asking for a copy, and ran out when the owner said he didn't have one. It's amusing that a book from 1992 could still instill such fervor in a person. It sells new for $125, used around $85.
Ian McEwan's Atonement. [Amazon p.] I really thought I'd hate this book, since it starts off as being about subtle errors in manners committed by a gathering of relatives and friends at a British country manor, but it turned out to be an interesting modern take on the genre. Update: After I read it, it turned into a movie; you can see in the plot when it was in theaters.
Executive summaryAt this point, I'm not sure why Amazon ranks books below the top thousand, except for a sort of geek factor. For all of the books here, it is basically impossible to say something like `Tivo for Dummies is ranked around 100,000,' since the ranking jumps by an order of magnitude almost daily. Similarly, there's no point saying `Atonement is ranked higher than Great Expectations', since you have a 50-50 chance of being wrong tomorrow. All we get is a very broad ballpark figure (a football field figure?), and a too-good impression of how many hours ago the last sale was made.Those of us interested in the sales rank of books outside Oprah's picks would be better served if the system were less volatile. In technical terms, if my guess that the score experiences exponential decay is correct, then the ranking system would be more useful to those of us watching the long tail if the decay factor were set to a smaller value.
Technical notesThe data looks to me like an exponential decay system, where you have a current score St which goes up by some amount every sale, but drifts down by some discount rate every period, St+1 = λSt. [Thus, if there were no sales events, your score would be St = S0exp(- λt).]
To fit this, I
flipped and renormalized the rankings so that one was the highest
possible ranking, and zero corresponded to a ranking of 500,000. Then, I
set the following algorithm:
As you can imagine, I found those constants via minimizing the distance between the estimate and the actual. The algorithm is an exponential decay model with λ = 0.96, and upward shocks as described. The only way I could fit the data was to make shocks when the book is at a low sales rank bigger than shocks when it has a high sales rank. There's surely a more clever way to do it. The green line shows the exponential decay model fit to the actual data. You can decide if this is a good fit or a lousy one.
You can also have a look at how the model fit to Madonna's book.
The codeFor the geeks with their own stuff to track, here's my code. You can see that it's Python glue holding together a number of tools that I take to be common POSIX tools, like wget and a copy of Gnuplot recent enough to render the PNG format. The main loop (while (1)) just runs the checkrank function, writes the time and rank to a file, and then calls Gnuplot to plot the file. The checkrank function downloads the book's page, searches for the phrase `Amazon.com Sales Rank: ###' and returns the ### part.Usage:
import amazon #assuming you've saved the file below as amazon.py.
amazon.onebookloop("03030303", "output_graph.png")
Notice that the ASIN number should be a text string, since if it were
a number an initial zero would often be dropped, and some ASIN have
Xes at the end.
This is pretty rudimentary; in the spirit of open source, I'd be happy to post your improvements.
#!/usr/bin/python
#(c)2006 Eric Blair. Licensed under the GNU GPL v 2.
import re, os, time
def checkrank(asin):
(a,b,wresult) = os.popen3("""
wget "http://amazon.com/gp/product/%s/" -O -
""" %(asin,))
exp =re.compile(R'Amazon.com Sales Rank:</b> #([^ ]*) in Books')
result = exp.search(b.read())
if result is not None:
return result.groups()[0]
else:
return None
def onebookloop(asin, outfile):
if not os.path.isfile(outfile):
f = open(outfile, 'w')
f.write ("""set term png
set xdata time
set timefmt "%Y; %m; %d; %H;"
set yrange [1:*] reverse
plot '-' using 1:5 title "sales rank"
""")
f.close()
while (1):
f = open("rankings.%s" % (asin,), 'a');
t = time.localtime()
r = None
while r is None:
r = checkrank(asin)
if r is None:
time.sleep(10)
f.write("%i; %i; %i; %i; %s\n"% ( t.tm_year, t.tm_mon, t.tm_mday, t.tm_hour, r));
f.close()
os.popen("""
sed -e 's/,//g' < rankings.%s | gnuplot > %s
""" % (asin,outfile) )
time.sleep(3*60*60) #3 hours.
[link] [5 comments] Replies: 5 comments
|