Patterns in static

A critique of R and S-PLUS





navigational aids:
 




News ticker:





topics covered:





the feedback logo. It rotates.

28 February 04.

[PDF version]

A few publishers have shown interest in my book on statistical programming. One asked my why I'm using C programming language instead of the more popular R. I told her that R broke my heart, and that I could write a thousand word essay on R's problems. Though, it seems I've blown clear through that word limit here. I've decided to focus on the non-technical discussion of why stats packages in general have made the practice of science worse off, and then tacked on a few S-specific notes at the end for the programmers.

Stats packages encourage data snooping and data mining
In olden days, the idea behind doing statistics was that you'd develop some theory about the world based on your observations, and then you'd gather data with the explicit intent of testing the theory you'd come up with. The premise was that people are eminently creative, and will make sense of whatever noise gets thrown at them, so hypothesis testing is thus built around rejecting spurious theories, rather than proving anything. [The wording is `we reject' or `we fail to reject'.]

There are two problems with this ideal in the real world. The first is that data gathering is expensive, and an efficient division of labor requires us to have people who work full time on producing large, general data sets, such as everything you ever wanted to know about a corner of the Black Sea, or the Current Population Survey (CPS) which asks a number of households a few hundred questions every month. The second part is that all the easy hypotheses you'd develop by thinking hard and watching stuff have been hypothesized, and we're now looking for more detailed information that can only come from looking closely at the general data sets there.

This draws us very far from the ideal: before, we were looking at the world at large and then gathering data to test the hunches we develop therefrom, and now we're looking at the general-purpose data sets and then using the same general-purpose data sets to test the impressions they'd just given us. The data can no longer be used to weed out the spurious hypotheses--and using next year's CPS to test hypotheses based on this year's CPS is still not going to weed out nearly as much as it should. But going out and gathering a purpose-built data set built around your hypothesis is hopelessly expensive, especially since every last grantmaking body you talk to is going to ask you, `Why aren't you just using the CPS?'

This is a big problem, because there's a lot of data in the CPS, and it can easily be proven (I have an amusing cite from Freedman available on request update: see below) that data mining from a set that large will always be fruitful. If somebody hands you a hundred uncorrelated, random variables, you can easily find five of those variables that explain another of the random variables with very high significance, nice t-statistics, a huge R2, et cetera--but by construction, it was all just noise. The CPS is not noise, of course, but when our statistical tests do as well with noise as with real data, then something in the overall method is wrong.

Hypothesis testing after snooping around the data can't be used to falsify spurious guesses, which puts science up a creek without a proverbial paddle. But I can offer one constructive suggestion: don't let interactive tools that explicitly push data snooping form the hypotheses for you.

Now, there are situations where people are not being ultraskeptical scientists, but are just trying to come up with some interesting facts for the boss, or a few new ideas. In this case, snooping around the data is not such a bad thing, provided you bear in mind that there's still a good chance that your observations are pure luck. But it's essential to not confound these methods with those of hypothesis testing if hypothesis testing is to have any legitimacy.

A representative line from the SPSS 8.0 Guide to Data Analysis, a textbook a friend of mine used for a general statistics class, published by SPSS Corp. “You should always plot the data first and then think about appropriate methods for summarizing them.” [p 133] The author (Marija Norusis) is clearly more interested in summarizing data than testing hypotheses, which is cool, but why does she then segue into the chapters on testing hypotheses without any note (that I could find) that the techniques are for different and basically contradictory goals?

As for R and S-Plus, their data presentation is carefully designed to facilitate data mining. Here is a truly interesting interview with the guys who came up with Trellis graphics, in which they explain the great pains they went through to make patterns pop out of the visual displays of data, and the psychological foundations upon which this stuff is based. Again, this is great for any of a number of applications, but not hypothesis testing. They dance by the issues I discuss here by saying that the graphic display “allows you to make a good guess about an initial model to fit and then to diagnose how well it fits the data” but then chide the interviewer for saying that you might as well run explicit tests for goodness of fit of every variable against every other.

There are valid reasons for graphing data in a hypothesis-testing context, such as sanity-checking to make sure that your results aren't driven by outliers, and that the data's range is basically sane and doesn't indicate errors by surveyors or instruments. But these are things you do after you've decided on what your regression will be about, and don't require nearly the visualization power that stats packages provide and encourage the user to use.

Of course, this is true of all existing statistics packages; I single out S because of what's to follow, but the emphasis on the plot first, ask questions later approach, as much in the documentation and the community as in the code base itself, gives statistics a bad name. As Professor NS of Saint Louis, MO, advises: never trust a regression run after 1975.

S is call-by-value only

On to the computer-geek part, whose primary lesson is that you shouldn't tie the user's hands unless you have a really good reason. R and S-Plus are based on the S language, developed at Bell Labs. This is the same lab that gave us B and C, so I guess the name is not particularly surprising. Anyway, the decision was made by the designers to not include a call-by-reference mechanism in the language.

By way of explanation for the people who are still reading but not familiar with this, it's all about calling a function like product(A, B). One approach (call-by-value) is to treat the function like a black box. The program throws copies of A and B into the black box named `product', which mangle these copies in any manner that seems expedient and then pops a product out the other end of the black box. The original versions of A and B don't change, since you only put copies into the black box.

The other option (call-by-reference) is to put A and B themselves into the box, and then the program returns the product, as before, and also potentially modified versions of A and B. The black box metaphor sort of breaks down, and your function becomes a component that is intermingled with the rest of the program.

There are loads of difficulties with call-by-reference; have a look at this little page, for example, for a long list of considerations. But to explicitly disallow it is to shoot oneself in the foot. If we had a language which allowed only call-by-reference, then we'd have no problem implementing a call-by-value function: just write local_A=A as the first line of the function and don't use A for anything after that. On the other hand, if your language allows only call-by-value, there is no way to implement a call-by-reference function, which is a pain. There are ways in S that involve unpleasant convolutions of the language, overuse of the global name space, et cetera, but as workarounds go, they are especially ornery.

The implications of this are many. First and foremost, the language is slow. Even with tricks to only copy when absolutely necessary (which add lots of code and fragility in and of themselves), the language is still a few hundred times slower than equivalent stuff written in C.

7 April 2006: Not to be a total theorist about it, I did a timing test for R calls to C code versus plain C code. I would call the results impressive.

Second, it makes good coding more difficult, since you can't just move code in your main function out of the way into a subfunction when it makes things more readable. In many (but not all) cases, you'll need to do a rewrite as you move.

Third, it dictates how the program is written: everything winds up being an assignment. Some people write like this anyway, and it makes for very nice looking code when everything is lined up neatly. But some things don't lend themselves to being assignments, and there's no alternative available. To give a concrete, trivial, and painful example, I sorely miss being able to just write var++ to add one to var. In R, I need var<-var +1, which is OK for such an easy case, but var[2,other_var*7-2 == var[3,]]<- var[2,other_var*7-2 == var[3,]] +1 requires an editor with a good cut and paste function and gives me a little more redundant code to keep synced up. In other languages, I'd write an increment(x) function which would unroll to x <- x+1, but in S, such a simple desire is impossible.

Update: You'll see in the comments below that since I wrote this, R gained a limited call-by-reference capability.

S has no forced typing
One of S's main delightful features over the other stats packages is that it allows for arbitrarily complex data types, notably data frames which gather lists of different types of data. Also, everything in S is implicitly typed, based upon what S thinks you mean. These two features are painfully contradictory. My main problem with this is with building a list of lists: say the first element of your list is L[1], and then your second is L[2]. Unfortunately, that first line there told S that L is an array of integers, not an array of lists of integers, so the second assignment will produce a complaint.

This is a simple case, but things get much worse when dealing with real data structures. And yes, there are ways around it, and yes, some of them even work. But in the end, they're all apologies for the fact that there's no way to just declare L to be a list of lists and not let S coerce it to a list of integers. Since I came to S because I was so enamored of its complex data structures, the fact that so much coercion goes on behind your back--and often can't be prevented--makes the lack of declared types a deal-breaker for me.

I could go on with a few other minor details that don't make for such a coherent essay, but I think these are already sufficient to tell us that S is a bad idea for the sort of work I talk about: hypothesis testing and simulation. If you have a data set which you just want to interrogate, you're not doing any heavy-duty programming, and you're not publishing in an academic journal, then by all means, R will work for you, and even has some nice features over other stats packages. But beyond that, you're better off with any of a number of options.

2 Feb 2006: Somebody did actually request the Freedman article, so since I finally looked it up, here's the citation:
@article{freedman:screening, author = {David A Freedman}, title = {A Note on Screening Regression Equations}, journal = {The American Statistician}, volume = 37, number = 2, month = may, year = 1983, pages = {152-155}, url = {Jstor} }
nddocument


[link] [4 comments]
[Previous entry: "Notes on the Fog of War"]
[Next entry: "Micropayments"]

Replies: 4 comments

on Thursday, January 10th, Felix Andrews said

There is some discussion of this post on the R mailing list: http://www.nabble.com/Re%3A-An-%22R-is-slow%22-article-p14715310.html
(There are doubts about your C vs R timing comparison.)

Also:

> "S is call-by-value only"
Not true. You can use environment objects (which are otherwise just like lists) to pass by reference.

> "I'd write an increment(x) function which would unroll to x <- x+1, but in S, such a simple desire is impossible."
Not true: see `inc` in the `Hmisc` package: inc(x) <- 1.

> "there's no way to just declare L to be a list of lists and not let S coerce it to a list of integers."
How about this:
L <- list()
L[[1]] <- 3
L[[2]] <- 4:3

on Thursday, January 10th, the author said

Thanks for the comments and the link (which mostly refers to this post, qv).

To summarize for those who don't click through, the discussion points out that the fisher.test function does some additional computation beyond the calculation of the basic statistics, and that may slow down the R code. Few if any readers looked at the part where I tried the test using multiple magnitudes and found that the data was very much not consistent with the claim that it was R-side bounds-checking that was slowing things down.

But really: this isn't an isolated test case, but is borne out of a great deal of heartbreak with R. I really did implement that agent-based model in R-with-C, with so much C code that there was nothing but about a ten-line for loop left on the R side---and that was still incredibly slow. Putting the main loop in C suddenly made the whole darn thing faster. I even made some effort to decipher the code controlling the R-to-C interface, but didn't have the time/ability necessary to get what it was doing.

The dotcall function (the back end to .C()) does a lot of work to convert from R's one-size-fits-all SEXP to C types, and makes heavy use of R_alloc, which, as you may know, maintains all data used by R in a single well-managed array. I don't have time to read through the management code, but carefully-managed memory is obviously going to be slower than memory that just calls a vanilla malloc() [and it's not very carefully managed, I might add, given how easily I can get R to thrash the hard drive]. The Fisher test code uses R_alloc throughout, and I had to go through and change all those calls to malloc calls when extricating the code from R. So the function wasn't just using R's memory management at the R-to-C border, but throughout the evaluations.

The thread also includes lots and lots of people confessing that they have no idea how to write in C. Of course, they don't come out and say it—instead they go at length about how writing code in C takes eons longer than writing comparable code in R. From what I've seen, the most common means by which people don't get C-style coding is that they don't realize how many libraries exist to facilitate everything. If you want safe memory allocation or matrix operations or linked lists or whatever, you can just call it from a well-tested and well-documented library just waiting to be linked in. Conversely, R's code base (131,000 lines of C code, by the code-counter I wrote in Ruby) is basically written from scratch, without calling pre-existing matrix, memory, or statistics libraries. R's source code even re-implements the qsort, which is already in the C standard library. [Though I'm sure the graphics and GUI are based on a library somewhere...]

Why does R have speed issues? Because so many people who use R are convinced that speed is a non-issue, because computing power will save us; therefore many of R's authors prefer to add more order-N checks and casts instead of carefully considering how to make the code base faster. Some of the commenters on the thread you link to were really concerned that maybe C calls and R_alloc should be re-examined, but they seemed to get yelled down by others who just made lots of ad hominem attacks against people who don't write in R. There are a whole lot of people out there choosing to use SAS, SPSS, and other speed-tuned systems with gigantic price tags and Byzantine syntax instead of using free and pleasant R, and they're not all morons. And as I learned the hard way, the major agent-based modeling packages aren't written for R. I wish the people who want to improve the calling mechanism and memory allocation the best, because I want R to be fast enough for me to use; as for the people who insist that speed needs to be a low priority, I refer them to a comparable discussion by Joel the guru facing the same bumper-sticker comments about developer time and CPU cycles.

As for a your detailed comments, thanks for educating me on those points---you clearly know R syntax better than I did even after using it constantly for a few years.

* Using parent.env() in a function call to implement call-by-reference doesn't count, because you need to set the variable names in the parent environment to match the name used internally. If the internal name for a variable is x, then it can't be named y in the calling environment or else the function has no way to find it. That breaks encapsulation. Though, R seems pretty content with making the names of a function's internally-used variables public, so maybe it's just a question of taste. Or maybe you're talking about a different type of syntax that I'm missing.

* I will confess ignorance regarding R's class-function syntax; thanks for demonstrating it. The source code for inc is simple enough:
"inc<-" <- function(x, value) (
x + value
)
It uses the fact that R basically takes a so-named function to be a class-type function, which replaces inc(x) <- 3 with inc(x, 3) and modifies x in place. So ya got me: you _can_ write a C-style increment function in R.

But this is an exceptional syntactic form. Say that we want to increment two variables at once, like inc(x, y) <- 3. To clarify, in C this would probably have a header like inc(double *x, double *y). You're stuck: there's no exceptional form that I could find in the documentation that would modify y. And wanting a function to modify two arbitrarily-named variables at once is not a particularly crazy wish.

My purpose in bringing up the whole issue of call-by-value is that it slows down R's math. First, you have to admit that the increment form is kinda obscure---so obscure that none of R's math routines use it, which I know because I've got grep and the R-2.2.0 source code on my drive. That means that the routines are slower than they could be. And you can't package a function using parent.env(), due to the encapsulation issue above (and grep tells me none of R's math packages do so). Thus, when you do massiveset <- library_fn(massiveset), there's no way around wholly unncecessary but potentially costly copying.

* As for the list-of-lists issue, gee, I guess your syntax works. I'd been doing more naive things like
L <- c(c(4, 3), c(3, 4))
which does not work as intended. I recall spending way too much time on this simple task and getting nowhere.

on Friday, January 11th, kenn said

You can do an increment function using parent.frame in several ways.

For example:

INC<-function (Y) assign(deparse(substitute(Y)), Y+1, envir=parent.frame())

or in a rude way:

INC <- function (Y)(
Y <- deparse(substitute(Y))
eval( parse(text=paste(Y, "<-", Y, "+1")), envir=parent.frame())
)

Now try

X<-1
INC(X)
print(X)

You get 2 in spite of the fact that you call it X in the calling environment and Y inside the function. Very simple! Although for the reasons beyond my understanding, only the second version works with subsets of objects like a row of a matrix.

on Thursday, October 30th, Justin BEM said

Since I'm a french speaker there will be mistakes in my english, sorry !

About critiques, I will say that their is not a perfect tool to satisfay every one. On the use of C for statistical computing, many drawback can be carried out. I appreciate your courage for critics , but your comments are not useful. Lexical scope in statistical language in disccuss in

Lexical Scope and Statistical Computing
Robert Gentleman, Ross Ihaka
Journal of Computational and Graphical Statistics, Vol. 9, No. 3, Systems and Languages (Sep., 2000), pp. 491-508
* Published by: American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of America

If you will reinvite hot water because S broke's your heat go ahead

Comment!
Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:
Name:
E-Mail:
Homepage: