| Complementing your stats package with SQL |
|
navigational aids: News ticker:
|
16 March 05.
The basic principle behind Apophenia is that data should be kept in a database until it's needed, and then just enough pulled out for analysis. There are some things that are much easier to do in SQL, and there are some things that are much easier in a matrix-oriented programming language, and knowing what to do in which context can save hours. ![]() Figure One: Just meant for each other. Things which are easier in a database¤ The mega-dataset, which asked every respondent eight thousand questions, of which you're going to use six. You need to read the whole file to get the data you need, but your computer doesn't have enough memory to hold all 800MB of data. So INSERT every line into the database, which will get written to disk, then select out the little bit you need. Your little laptop won't even break a sweat. By the way, don't forget: if you run regressions on all 8,000 variables, you're a bad person. ¤ Anything involving more than two dimensions. This is really what databases are designed for. You've got one data set which relates cholesterol levels to smoking, and another which relates smoking rates to frequency of getting laid, and you want to show the correlation between high cholesterol and multiple sexual partners. Such merging of data sets is a basic operation in database land, and a total pain in matrix terms. [Select t1.*, t2.* from cholesterol_data t1, sex_data t2 where t1.smoking_rate == t2.smoking_rate. And yes, I know this is wholly spurious statistics; it's an attempt at humor.] ¤ Aggregation. Say you have a few observations of income (between one and ten, maybe) for every ZIP code, and you want the average per ZIP code. Again, a total pain via for loops through a matrix but one line in SQL: select zip, average(income) from data_table group by zip. SQL is limited by not having any real aggregation functions outside of average(), sum(), and and count(), but that's all you'll need 90% of the time anyway. [You can still do weighted sums by things like sum(income * weight).] ¤ Subsetting. Some languages (matlab, octave, R) do a good job of this, but others (Stata, I'm told) have no really easy way to pull subsets of the data a la select * from table where X*Y>.5. As you can see from the example, SQL is built to make this stuff trivial. Things which are easier in a matrix-oriented program¤ The actual math, the regressions and MLEs and such, are not gonna happen in the database, so after you've done all the data-shunting in the database, you'll have to pull it to a matrix for the final analysis. As above, the most math you can do in a database is basic arithmetic, and I haven't yet seen a db program which can be extended with user-defined functions. E.g., no reasonable query will attach a Gaussian-distributed random draw to each observation. ¤ Anything in which the data must be ordered, such as producing a CDF from a PDF. [But time series guys, you can lag in SQL: select (t1.income - t2.income) as diff from data_set t1, data_set t2 where t1.date == (t2.date - 1)] ¤ Real live matrices (as opposed to data sets). SQL is an algebra on tables, but its concept of the product is pretty drastically different from the matrix algebra product. Taking the transpose of a database table just makes no sense. Summary: database methods are not a panacea for anything, but are an excellent complement to matrix-oriented programming languages, because things which are difficult in one are often easy in the other. If you know what is easier in which, you can save a whole lot of your life not having to write little procedures. Policy implications: read up on SQL (many a tutorial out there). If your favorite stats package doesn't handle database operations, you've got two choices: dump it and get to know something like Apophenia, or get a standalone database program and use both. Write the data-massaging half of your scripts in the database program, then write the analysis in the matrix-handling program. If the two programs are worth anything, they should have command-line and text file reading/writing utilities to facilitate this.
[link] [2 comments] Replies: 2 comments
|