Patterns in static

R is slow





navigational aids:
 




News ticker:





topics covered:





This site is listed on Blogwise, the DC Metro blog map, and (sort of) DC blogs.

the feedback logo. It rotates.

30 March 06.

[PDF version]

I find that people I talk to often don't realize how quickly their computer can do computations. Somebody doing analytic work will complain to me that they are using a nice, lean program to shunt data, but it still takes all night or even all week for their program to run; they presume that they need to buy better hardware, or tweak their code to run that extra 10% faster.

But it is the software that is slowing them down by orders of magnitude.

I had a chance to test the speed of R today. A user requested that I add a Fisher Exact test to Apophenia. Being a smarter programmer than mathematician, I knew to copy and paste the code from elsewhere.

So, I took the code from R. That is, Apophenia, a library of functions for doing stats in C, and R, a stats package, use exactly the same C code for doing this test. There were a few tweaks to the code to de-Rify it, such as replacing R_alloc with plain old malloc, but about 98% of the code is identical.

There is no better chance to test the two. The Fisher Exact test is a computationally-intensive activity, but both R and the C library are using the same algorithm. I wrote an R script and a C program to do a million Fisher Exact tests on a two by two matrix (the code is printed below) and used time to see how long they took. I made sure to not print anything to screen, and used the same matrix repeatedly because random number generation routines would vary.

Since the Fisher test uses factorials, there is much more number-crunching to be done when the figures are larger. So I tried one run with a two by two matrix with values in the tens, and one with values in the hundreds.

Here are the total times, and their ratios. s=seconds, m=minutes:

Update, 12 January 2007:

This post recently got a lot of attention, and a few points of critique, the most salient being that fisher.test does some additional computationally-intensive work to produce additional statistics after calling the C routine. So in my efforts to run a fair fight, I produced an edited version of fisher.test that returns immediately after calling the C routine; if you repeat these timing tests with the stock version of the function, the R-to-C ratio will more than double.

I've thus deleted the the old timings, which put R in a worse light, and replaced them with more comparable numbers. I apologize if that means some pre-edit comments are now out of sync, but I felt this would be fairer than leaving up the old, worse-for-R numbers.

And to address one more hypothesis: no, this isn't about paging. I kept an eye on my hard drive light and it did nothing. I have done other timing tests where R did thrash the hard drive, on a problem where C code with no memory management managed to run just fine, but today's post is about processor speed, not memory use.

C time 2m 49s    
R time 88m 47s    
R time/C time 31.5    

As you can see, R didn't take longer by a few percent--it took about thirty times longer.

When I tell people that their work could be faster if they don't use a stats package, their first response is typically disbelief, thinking that the difference can only be a few percent. But even with a toy example like this, you're seeing a thirty-fold slowdown.

Their second response is to say that oh, they can just write C code to do the computationally intensive parts and call them from their favorite package. But here, I wrote an R script that does nothing but call C code over and over. This is where you'd end up, for example, if you had a simulation run from R with all the math being done in C. Many estimations of global parameters of a model would work like this. There are little data-shunting tasks on the R side surrounding the C function, and I could move those to the C side too, but at that point, why am I using R at all?

Now, R has its place: I'm not going to tell a statistics student or somebody doing exploratory poking around on a medium-sized data set to learn C. The code base is so complete, I often use it as a reference, as above. But nothing comes for free, and the cost of the convenience and general pleasantness of R is difficulty with large computations.

All this generalizes to nonscientific software as well: getting rid of the desktop manager (GNOME, KDE, or Microsoft Windows) makes a huge difference, and using a text editor instead of a GUI word processor is also night-and-day in terms of speed. But that's not the rant I'm writing today, because for user-at-desk applications you can basically do what you want with a slow program and at worst it's just a little more frustrating. But when a simulation, optimization search, or inversion of a large matrix takes several days instead of under an hour, there are techniques that one as a practical matter can not use. The science you do may actually suffer because of the tools you are using.

OK, now go read this textbook on statistical computing in C and live a better life. Mr. GK of San Diego, CA did, and his analysis that took overnight to get 25% through in Matlab now runs in an hour. That person who requested a Fisher Exact test for Apophenia is somewhere not wasting hours right now. Art Garfunkel didn't switch to C, and his acting career never took off. I used to have a simulation written in R calling compiled C that took overnight to process 100 agents, but now that it's all in C simulations with 9,000 agents run in forty minutes. Don't risk it--learn to do statistical computing in C today!

Details
For those of you who want to try this at home, here are the scripts I used. Readers who have other fave stats packages are encouraged to send in their own time comparisons with R or C, but note that some packages may approximate large Fisher tests via a chi-squared test. For the purposes here, they are cheating. And once again, please note that I used a hacked version of the R-side fisher.test function.

test_ct <- 5e6
x       <- matrix(c(30, 86, 24, 38), nrow=2)
for (i in 1:test_ct)
    {fisher.test(x)}

And the C code:

#include <apop.h>
int main(){
int         i, test_ct  = 5e6
double      data[]      = { 30, 86, 24, 38 };
apop_data   *testdata   = apop_line_to_data(data,0,2,2);
    for (i = 0; i< test_ct; i++)
        apop_test_fisher_exact(testdata);
}

The reader well-versed with Apophenia will notice that there is a memory leak, because apop_test_fisher_exact returns an apop_data struct that never gets freed. But a million lost matrices didn't affect the speed of the program at all. The lesson from this is that the details of memory management that R is handling for you are not such a big deal on a modern PC anyway.

Relevant previous entries:
Some haphazard advice I gave to a few grad students on using C

The one about object-oriented programming in C

The one about data mining, which really applies to C and R equally well


[link] [3 comments]
[Previous entry: "Anatomy of an op-ed"]
[Next entry: "Anti-intellectual"]

Replies: 3 comments

on Thursday, January 10th, KW said

Interesting article, but CPU time is incredibly cheap and developer time is quite expensive. The economical decision is to help the developer work fast, which is why interpreted languages are so wonderful.

on Thursday, January 10th, S Ellison said

The grave loss in C speed here is largely caused by two things: i) R is doing far more preliminary data checking (in native R, not C) than the C routine: ii) There is an absurdly large overhead in repeatedly breaking out into C from a scripting language. This is far and away the least efficient way to run a looped operation; the efficient way is to do the repetition inside C. Here, the results is a slowdown by an order of magnitude or so more than it would have been if programmed efficiently.

Having said that, noone should ever expect a scripting engine - as R largely is - to run as fast as precompiled code. It carries an interpreter overhead that cannot be reduced.

As it happens, a lot of the heavily processor-intensive stuff in R _is_ inside C where it matters, but even then, the scripting wrapper has to do _some_ work.

What you get for the inevitable performance hit in a scripting language (Perl, Python, Ruby, PHP... you name it) is massive gains in programing time and simplicity.

For research and prototyping, use scripting every time. For real-time programming, that's not such a good idea...

Horses for courses and all that.

But please don;t claim a scripting language is slow because you use it inefficiently!

on Friday, January 11th, the author said

To those of you who are telling me that R is giving you "massive gains in programming time and simplicity", do compare the R and C code above. I know the R code is a bit cleaner, but I'd hardly call the difference "massive". I wasn't running a stopwatch, but developer time on both the C and R scripts was about the same for me.

There seems to be a perception that if you're writing in C, you have to write everything from scratch using only bit-shifting operators. C has been around for decades, so there's a library to do about anything you could want. Sure, if you're writing new code to do LU decompositions or run regressions, you're spending massive amounts of developer time. But the GNU Scientific Library will do an LU decomposition with a single function call, and the Apophenia Library will estimate OLS regressions in another function call, for a total of the same two lines of code it would take to do those things in R. And given that the decades have brought about better interactive debuggers for C and a broader range of libraries allowing a broader range of coding styles, the claim that developing in R is easier or less error-prone than developing in a well-equipped C environment ain't so obvious.

I do agree with you, S, that R has its place, and I couldn't possibly criticize a person for preferring one language over another. I understand the fun and utility of scripting languages—I use Python and Ruby on a regular basis. But I wanted to highlight that the speed hit is not necessarily just a couple of seconds that we can brush off with a quip about cheap CPU cycles: in my experience, the slowdown is big enough that computationally intensive work like agent-based modeling or simulated annealing of an arbitrary objective is just plain impossible. [R's simulated annealing routine only works for a specific form of objective function. I'm not sufficiently familiar with the form to know how broadly it applies, but I know that everything I've used siman for in the past year won't fit.]

Naturally, nobody on the R project is going to tell you that sort of thing, and I lost a few months of development time because I thought that coding in R would be just a few percentage points speed difference, rather than a project-killer.

R broke my heart, and that's why I posted this.

PS: I think it would be interesting if some R experts submitted some code to the Language shootout. I've told you my methods for the timing comparison above, and I am fully aware that one could poke holes in it. I (and I'm sure many others) would be interested to see R tested in a more formal setting. If you do submit to the shootout, please post a link to your results in the comment box here.

Comment!
Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:
Name:
E-Mail:
Homepage: