| Some fluff, some info |
|
navigational aids:
|
02 December 03. RTFM [Read the manual] In which the author bitches, and then gives practical advice that will save (segments of) the reader's life. OK, so here's some quick math that I worked out on a little spreadsheet. Say there's some little routine that takes you five minutes a day, and you could do some tedious work and eliminate that through an hour of research into correctly automating the thing or something. Then over the course of a work-year, you'd save 2 and a half days of time. In the words of the great Margaret Cho, you could take a pottery class. So some research up front can pay for itself many times over; this is nothing your momma didn't already tell you. But when people sit in front of a computer---the paragon of automatability---people instead want something they can use immediately. A good program, we read, does exactly what the users thinks it should from the start (i.e., is intuitive). It should require no learning of new methods, and should instead analogize to real-world actions like pointing to things (i.e., clicking on pictures). So if I've set up this essay correctly, it should be completely obvious that the first and second paragraphs there are in direct contradiction. From the perspective of efficiency, intuitive design is not helpful. On occasion we're lucky and the most intuitive and most effective method match, but in most cases we need to learn some new little method to implement the most effective route. I am flabbergasted, horrified, and in despair over the resistance people have to learning how to use software. People will read the entire frigging Chicago Manual of Style rather than learn to use new software (see below). They're using this one program because in 1997, when they were afraid of computers, they could pick up that program and use it in the first hour---but it's been six years and they've used that program for a thousand hours a year, and they're no more efficient at clicking the little icons, and they spend more and more time trying to do increasingly complex tasks until they realize that there's no intuitive way to efficiently do the complex things they have to do. Oh, the mortal cost of that first hour! What a seductive compact with the devil were those icons! The little sound effects were so pleasing, yet the software proved to be a siren which drains away all life, one pleasing mouse-click at a time. We should face facts: we're in front of computers, all day, every day. That doesn't mean we like them, but if we like what they do for us, then it's really worth an initial effort to work out what's good software---not in terms of what matches your immediate intuitition, but what you can learn to use efficiently as you use it every frigging day until you finally reach the ever-growing retirement age and can finally just stay home and surf the Web. Why I care One: I have to give tech support to people all day long. The fewer people using cute but annoying software, the less tense I'll be. Two: I want better software that I can use. All that effort that goes into making better eye candy for Word is wasted for me. Three: ``No [person] is an island,'' wrote John Donne, and it truly, physically pains me to hear the travails of somebody who wrote their dissertation in Word, which you'll recall the New Yorker describing as a terrible program. Things you can do Here are some things that are not intuitive, require reading the manual, and will save you huge tracts of time. Listed in order of commitment. You don't have to read the whole manual for any of this; you don't have to memorize anything (just leave the manual open for reference somewhere and you'll remember the important stuff soon enough); you just have to learn enough to understand the basic framework and to know where to look for more info. Geeks will notice a thematic relation between this and the essay of 10 November, q.v. Stop using the mouse. Most of what you can do with the mouse you can do with the keyboard, such as switching applications ( [And you have learned to touch-type, no? Me, I learned by putting a t-shirt over my hands so I couldn't subconsciously peek.] Use style sheets. Witin your word processor (e.g., Open Office), there's a feature that gives you a list of header types. Instead of specifying `boldface, larger, new font' for every header, you can instead select `header 3' and the rest is set up for you. This requires initial setup and some cognitive effort, but once you've set up your headers the way you like them, you never have to do it again. When your advisor/boss tells you to change all your italicized headers to underlines, you only have to change it in the style sheet, instead of hunting through a hundred pages looking for all the italicized headers. [In Open Office, this is `the stylist'; no idea what it's called in Word. RTFM.] Never, ever, write a bibliography by hand. Intuitive, direct method: read the Chicago Manual of Style and learn the rules for italics, punctuation, and ordering. Efficient and effective method: use a bibliography database. LaTeX has bibtex; Open Office has a built-in bibliography manager; for about a hundred dollars, you can purchase EndNote for Word. [The lack of a bib DB is reason enough to dump Word. Thinking about the pain my mother sufferred writing her bibliographies in Word makes me well up.] You have to set up your bibliographic info in a database, then insert formatting codes into your document, and then you're guaranteed that every last comma and period will be in the right place, and that when you refer in text to Schweitzer[15], that Schweitzer is not number sixteen or fourteen in the bib. Stop using your spreadsheet as a database. Spreadsheets make it easy to instantly create a list of things. The same guy I linked above (Joel on software) worked on the MSFT Excel team. There they were at MSFT, designing amortization wizards (because amortization should be intuitively obvious), and one day they watched some users actually use their stuff, and found that most users just used their software for writing lists; so they added lots of list-making and list-handling features. However, there's actually an entire field of program built from the abstract algebra up for making and organizing lists: databases. As with everything else I'm talking about, databases are conceptually unintuitive, require setting up before you can start writing lists, and will save you hours and hours of your life in the long run. You may have a copy of MSFT Access on your hard drive right now. [Data geeks: instead of using SAS or whatever, check out SQLite; between that and the GNU Scientific Library, I have the data analysis package I'd always dreamed of. Since 10 November, I really did switch all my models and statistical analyses to C, and the effort has already paid dividends.] As a matter of fact, dump Word entirely. Even ditch Open Office, which is nicer but still the same ballpark. LaTeX is the best document preparation system in existence, by far. Many authors typeset their books in it. Same as above: initial learning curve, new paradigm, won't get anything done the first few hours, but will save you weeks of pain in the long run. [There are graphic shells for LaTeX, like Scientific Word. They unabashedly suck. I have pals who write their complex equations in SciWord and then cut and paste the resultant LaTeX codes into their main text-edited document, which seems like the optimal use of these things. People who LaTeX from Windows are fans of WinEdt, which has some nice LaTeX-specific features.] No, better still, dump Windows entirely. Do I need to say it by now? Windows is aimed at being intuitive from the get-go, while the Unix-type systems are aimed at automating and simplifying your work and then getting the f*ck out of the way. If you have a life sentence with one paradigm or the other, forced to spend a thousand hours a year chained to a keyboard, which would you choose? Windows people: try Cygwin, which installs a Linux box within Windows, so you can use LaTeX there, and produce beatiful PDFs that you can pretend you wrote in Word. [That's how I'm writing this now, so if a Windows fascist comes by, I can minimize Cygwin and pretend I'm using Windows. In my experience on this machine, Windows & Cygwin coexist nicely.] MacPeople: if you're running OS X, you've already got a Unix box; get to know it and the non-Mac software which will help you do the above. The commitment here, by the way, is not in installing the OS, since Unix will coexist with Windows or OS X, but in learning the myriad or two of programs which Unix facilitates. If you're still avoiding work, have a look at my essay on Unix and its users, linked at right. |
06 February 04. A lament about bad design OK, this was going to be a lament about any of a number of things, with a general discussion of how to design things to be useful but not annoying. However, I fear that it's going to turn in to a rant about MSFT. Sorry, guys. It's about designing things for dumb people. This is, by itself not a bad thing. After all, designing for less cognitive effort benefits us all, just as features designed for the handicapped are often embraced by able people who just want it easier. [I've been using the twiddler lately; it would be most useful for people who only have one functioning hand, but I like it because it lets me drink tea and type at the same time.] Or, at the other extreme, here is an article about how an unintuitive interface killed John Denver. The two prime examples of this would be advertising and MSFT products. I think the whole thing about idiot-proofing Windows has been discussed to death, and needs no elaboration here. I've already talked about how advertising has gone from a long textual evocation of the product (with bold headlines for those who are just skimming) to a picture of the product being held against a pair of breasts. Don't get me wrong, intuitive interfaces and things that dumb or inattentive people can readily digest are not necessarily bad. But what makes them horrible is when the design makes it impossible or too difficult for not-dumb people to go beyond the dumb level. For example, when waiting for a subway train, I am confronted with a number of large, backlit ads right in front of me, and I typically have about ten minutes to kill. This is the perfect opportunity for the vendor to tell me all about the product, in great, backlit detail. And yet all settle for a picture. and a tag line. with inappropriately placed periods. where a comma or hyphen will do. This is true for both SBUX, which will just show you a picture of a cup (which we presume contains coffee) or Boeing/McDonnell-Douglas, which outbid SBUX for ad placement at the Pentagon Metro station, and advertises bombers and helicopters. Surely there's more information that we need to know about the latest bomber than about a cup of coffee? But the advertising won't tell me. If I care, there's nothing for me to do for ten minutes but to stare at the picture some more. Oh, I could look elsewhere, but I'm not elsewhere. I'm on a subway platform, waiting. I want to see the information that got thrown away in a desperate attempt to get the point across with a minimum of cognitive effort, and am frustrated that I can't. The other prime example of this is of course anything written for MSFT Windows. It's easy to sit down and use, which I would be an arse to be annoyed by, but it's supremely difficult to go beyond the easy stuff. Spent an hour yesterday trying to get the cute little browser thing to hide files beginning with a dot. I even wrote Dell tech support, who blew me off. As you can plainly see, if I want something that seems possible, but isn't, I will be frustrated and unable to continue to function as a normal human being. My favorite foil whom I've linked to before, Joel, goes on and on about how frustration comes from having things that don't work the way you expect them to, adding little bits of cognitive effort and annoyance to your day. Joel probably describes many people, but I am most frustrated by tools that just plain don't work. Screwdrivers are truly counterintuitive, if you ask me (to make screw go out, turn counterclockwise?), but I learned the righty-tighty/lefty-loosy thing. When even that doesn't help, (like the screw is upside-down or with the few reversed-thread nuts on a bike) I am indeed frustrated. But I am infinitely more frustrated when the screwdriver is made from cheap metal and bends when the screw is too hard to undo. Implicit to all of these things is a promise: I will tell you about my product; I will help you make your document look just right; I will unscrew your screws. Sometimes getting that promise to work takes some compromise from both sides, which is how life is. It's not the compromises but the broken promises that really hurt. Somewhere, I read about how temperature gauges are less common on cars now, since somebody worked out that most people interact with the gauge by just looking to see if it's in the red, and panicking if it is. So why bother with a gauge? Instead, you just get a little light that tells you when the temperature is in what would have been the gauge's red part. I told Mr. DRC of Santa Monica, CA---a car expert if ever there was one---about this, and he had a hissy fit, listing three dozen things you can learn from a temperature gauge beyond whether it's in the red. So in following Joel's advice about minimizing cognitive effort for 95% of drivers, the other 5% are frustrated and dejected. It doesn't have to be that way. Design that includes the lazy doesn't have to exclude those who care, and if it does, it's as bad a design as one that only makes sense if you study it for an hour. I tried to come up with more examples of where things have been redesigned for the lowest common denominator and thus shut out those who care, but couldn't think of anything really good and pervasive. Television has always been written for dumb people, and since there's a time constraint, you have to pick your level of information and stick with it---unlike a print ad, it's physically impossible to say more. There are thousands of books with `for Dummies' in the title, but there have always been such how-to books, and for every such book, there's another that goes into all the detail you could want. This is even true of management books, which are typically the most supremely oversimplified books in existence, since businessmen often have a pompously overinflated idea of what their time is worth. Perhaps you, dear reader, can leave some suggestions in the box below. Meanwhile, I have nothing but a lament about the two realms where withholding from the consumer is vehemently defended as a good thing: working with PCs, and advertising. One particular item stands out as the intersectionof the two: MSFT PowerPoint, a computer program for creating advertising presentations. Its design makes summarization and mimimization of cognitive efffort easy and information dissemination difficult. E.g., as a counter to the too-difficult design interface which caused a disaster above, PowerPoint's design is partly responsible for the destruction of a Space Shuttle.
|
12 April 04. Linkfest I So I've added an RSS feed, at left. With an RSS reader, such as amphetadesk or RSSowl, you'll get notified whenever I update this little web site. The idea is that you just check in to your RSS reader instead of clicking through to the dozen web sites you dutifully check every morning. Like all the other crap I endorse, it's not revolutionary: you spend half an hour setting it up and reading the manual, and then it saves you four minutes of clicking per day = a full day of clicking per year. I have to admit I've only had my own RSS reader for a day, and only have vicarious raves from people who say that having one of these little news tickers open has entirely changed the way they get information and has made them finally feel that enlightenment is attainable in this lifetime. I wonder what I'll do to kill time if I can't waste it checking to see if anything new has turned up on Plastic in the last six minutes. Asst links But there's always more (not-regularly-updated) junk to be had. In addition to the usual list of links, I offer the following, for my fellow distraction-seeker. I'm at the World Bank today, copying data sets to my own portable hard drive. Instead of a bar crossing the screen, the little application opts for a ball that grows in size. Very cute. Almost beats the status bar from Halo. On the desk here is a copy of Bank Swirled, the in-house humor magazine. It's filled with in-jokes and standard office humor. Representative sample: ``hello. I'm a constipated water buffallo. Is there a World Bank program to help me?'' Pocket calculator show Equally retro but more hands on, here is a set of Infocom text games for you to download and play. These games are a paragon of good computing and bad humor, and can be run on pretty much any modern hardware (including a lot of phones). This photographer took some wonderful photos of Thailand. Interspersed with the photos is an extensive discussion of how he went about backing up the digital photos onto both a CD and portable hard drive. The contrast is stunning. Oh, but it gets geekier. Here is a list of numbers. I've been very interested in alternative keyboards. I mean, you can have the most efficient software on the planet, but if you have to wave your hands around in painful ways to use it, then it's still not efficient. When I have any say in the matter, I use a split keyboard with built-in touchpad, in my lap. But I sometimes fantasize about what life would be like with these more innovative designs. OK, time to take a stretch break. And, of course, if all else fails, there's always cartoons, TV, or naked people.
|
14 July 04. RSS, again. So back in April, I'd written about the joy and delight of the RSS feed. The summary: whereas the web had once been an endless pit of time-consumingness, RSS made it manageable and easy. Whereas I'd spend all day in glassy-eyed clicking before, I could now spend a little under an hour reading everything I could possibly read, and could then move on to do things involving the real world. Oh, how times have changed. I now have so many RSS feeds that it is a truly Sissyphean task to read them all. Whereas before I could have a set, fixed endpoint (`stop when I've read all the feed updates'), this is now impossible. Meerkat, O'Reilly's wire service, will feed me a thousand links a day. Seriously reading two percent of that is already an hour. And I haven't even gotten to the newspaper yet: the New York Times will feed me a hundred articles a day, of which I'll want to read a whole lot more than two percent. In other news, I've entirely stopped reading anything that doesn't have an RSS feed. I paid some guy to write up an RSS feed for Toothpaste for Dinner because it's funny in a severly embittered kind of way, but with no RSS feed pushing it upon me, I never looked at it. So I'm more lazy, but not actually saving more time. The other thing that amazes me about all of these RSS feeds is the immense repetition. First, there's direct copying: Meerkat aggregates other RSS feeds, without editing, and puts them out for you in one feed instead of several. [And so, given that feed readers are now a dime a download, I'm not sure what its point is anymore.] And of course, other blogs frequently have entries among the original content in the way of `so over at this blog, they say...' without really adding much of anything. But beyond that is the original generation of the same idea that ten other people also originally came up with. As some of you may know, I'm writing a book on software patents, so many of my feeds are about intellectual property, and frankly, the news is almost exactly the same on all ten of `em. Even the stories themselves tend to repeat; I've simply stopped reading DMCA cases, they're all so alike (as is the outrage they inspire again). This form of repetition feels even sillier than the blatant copying above---at least there was no effort expended in copying. Here, there are extensive write-ups which boil down to the same facts and the same emotions; if these guys all teamed up, they'd have one feed with the same content and a tenth of the effort. In my own head, this turns into a constant pressure to not repeat myself or others. The `others' part is frankly kind of easy: since all ten of our IP authors look at IP in the same way, I really only have to distinguish myself from one mindset. The `self' part is getting harder and harder. This is my 75th blog entry, and I simply don't have 75 actual real live ideas. This blog has been up for almost a year, but others have been up for the better part of a decade---how do they do it? It constantly worries me, and is the reason I've been posting less lately: when am I going to hit that age when everything I say is just repetition, and have I already hit it?
|
24 October 04. How to live with multiple computers [A dump of a large number of small pieces of advice from years of shunting data. File under `boring but useful'.] Problem statement: You work on multiple computers: home and away, or a desktop PC and a laptop, and you want to use both to work on the same projects, so you need some way to reliably shunt data back and forth. Me, I have three home laptops in various states of brokenness, a PC at (name of think tank) and a PC at (name of university), and a few accounts in the ether somewhere which I log in to from time to time, and all that has to keep reasonably synced up. Buried into all of this are a few regimes for backing up data. Not to be mean or anything, but to those of you who aren't backing up their data regularly: what are you, fuck*ng stupid? If your work is worth doing once, it's worth making sure you don't have to do again. [Personal to Ms. JATMM of Mount Vernon, VA: my continuing condolences. I don't mean you.] Step one: have a home dir: Unix geeks, you already have a home directory. But you Windows users probably have data spewed all over your hard drive. Many programs like to make their own little directories for storing data, so your MP3s are in c:\Program Files\Brand Name\Program Name\Data, while your great unfinished novel is in c:\Another Brand Name\Word Processor\user Xp458yjz\documents\. Get them all in one place, which I suggest you name c:\home. [If you have multiple partitions, use the one that the operating system isn't on, which will probably be d:\home. If you don't know what a disk partition is, don't worry about it.] MacPeople, you also have a home directory, but OS X tries to hide it from you. You have a Unix command line (the Terminal, in the Utilities folder); get to know it. Since I don't have an iAnything at hand, maybe somebody else can offer more suggestions in the comment boxes below. Put all your data in the home directory. Make yourself c:\home\novel, c:\home\music, et cetera, and stick with the plan, even when your programs try to distract you. Linking (aka shortcuts) can help with this: e.g., set a shortcut from the Desktop directory (wherever it is; Microsoft keeps moving it) to c:\home\desktop. The idea here is that if your computer falls to pieces, all you have to do is reinstall the software and then copy back your home directory, and it will be as good as new. You probably have reinstall disks for the software somewhere; you're basically making yourself a reinstall disk for your work. Further, the costs of hard drives will hopefully be in sync with the amount of crap that you increasingly collect over the course of your life---and so you'll never have to throw anything out, ever. Instead of putting your projects on floppies in the closet, you can keep everything you've ever done in your home directory, perhaps archiving from time to time into an archive subdirectory. Me, my /home/b/arch directory includes everything I've done on a computer since 1999. This comes in handy more often than you'd expect. Also, dear Windows people, you may want to get a copy of Cygwin, which will let you execute the Unix commands below. You want this to be as easy as possible, and writing a batch file to run the command will, in the long run, be a lot easier than clicking lots of little boxes in WinZip every time, despite the initial setup. Also, children and animals will like you more. option one: Directory replication We have two options here: the first is to find a place online, and the second is to carry around a physical object. suboption one: Online: this web site is hosted by Spiderhosts.com. I pay fifteen dollars a year and get thirty MB of storage. [Tell `em I sent you and I'll get a discount on my renewal.] That's enough for a lot of the important stuff. There's also gmail.com, which will give you a frigging GB, and if you need more, I guess you can just get two accounts. Depending on the status of your connections, the internets may or may not always be available; this obviously won't work if you're not ubiquitously networked. [Personal to Mr. PH of Seattle, WA: now that Presidential decree has pluralized internets, I think it should be lower case.] subsuboption one: with compression One line: subsuboption two: without compression Use rsync, which will only transfer the changed part of files which have changed, making for a very quick transfer: The --delete option gives me a rush of fear and adrenaline every time I use it. But even without that, this is dangerous because if you reverse the from and to directories, then you'll overwrite your new stuff with old stuff, thus losing all the work you just did. Nothing sucks more. With compressed files, you can of course have the same problem. More advice: think of one and only one drive (your home hard drive, work hard drive, your portable drive) as your primary home directory. In others, set up an xfer_me directory of things that need filing back to the primary drive. The assumption is that if any of the non-primary computers go up in flames, you won't care. Day to day, you just need to transfer the xfer_me directory back to the main from time to time, and then overwrite the non-primary home directories with fresh copies of the primary now and then. Another useful tidbit: if your computer is a laptop, and you have a home network, then both of your computers are sharing a subnetwork. Your desktop probably has an address like 192.168.2.101 and your laptop something like 192.168.2.103. If one of these can run sshd, then you can directly rsync directories, and it'll be rather zippy. suboption two: physical device: This would be something you carry around with you. Pretty much anything will work: those little USB keychains, your MP3 player, even your digital camera can probably store files for you. Portable storage is frigging everywhere. The obvious problem is that if you forget it at home, you're screwed, but it will of course work if you have no Net connection. Also, don't forget the cable, which I've done embarrassingly often. [Spare tip for buying a spare cable: A bit of poking around online should find you the formal name of your cable---more and more are A-mini B---and you'll find that you pay a lot less for an `A-whatever cable' than a `cable for brand name device'.] Devices that your operating system recognizes as just another drive are the best; some devices (Kodak cameras, Apple iPods, Creative MP3 players) require additional software, which will require installation everywhere you want to use it---annoying. On such devices you may have to archive your home directory into one file (see subsuboption one, above) and then copy that single file over. The world of devices falls into two categories: flash memory based and hard drive based. With a hard drive based device, you may as well make _that_ your home directory, since you should have enough space to do so. The only problem is that there's a slowdown with external hard drives; unless you've got really good equipment, it's noticeable. [Me, I have an Archos Jukebox, which mounts normally and has the imprescindible feature of rubber bumpers. [That's Spanish for `can't-pass-up-able'.] I partitioned it into a vfat partition for the music (`cause the firmware demands it), and a reiserfs partition so I can have the pleasant features of real-live links, permissions, and a journal.] Flash-based devices tend to give you much less storage---you'll have to leave the MP3s at home, and just back up the important stuff. That's OK with the above methods, since both give you options to exclude directories (which is why you want to do this from a batch file, where you can specify what to exclude once and for all). Not a problem, but don't forget to back up the data you're not carrying with you some other way. Since you're not networked, you don't need the `e ssh' clause: The biggest practical problem with both of these methods is that you have to remember to transfer every single time. If you work on the project at home, rush to work without transferring, and then work at the office, then reconciling the two versions will be a nightmare. You'll almost certainly lose data, which may make you cry. option two: versioning systems The idea of a version system is that you have a repository somewhere which holds the project. When you want to work on it, you check it out, and then check it back in when you're done. Once checked in, your copy is entirely disposable. Using a versioning system changes your mindset. You can screw with your copy of the project all you want, since it's just a copy; you don't have to save lots of revisions, since the system is doing that for you; you'll find your work will be more structured around work sessions which have a specific goal. It's fun. Finally, it solves the problem of forgetting to bring to work changes you made at home, since it will do its best to merge two modified versions without losing anything. This sometimes requires human assistance (it'll tell you when), but you never lose work. The standard revision control system is CVS, which (if you're using Cygwin or anything unixy) is on your hard drive now. CVS has been replaced by subversion, but subversion isn't yet common, in the sense of any given computer basically being guaranteed to have it. If you're reading this more than a year from now, try that first. Once you set it up (RTFM), the only commands you'll ever need are Oh, and don't forget to back up your CVS directory from time to time. Mine crashed once. I managed to not cry but I was grinding my teeth in my sleep a lot after that. You'll also have to decide where to put the repository. If you're a student at a university or a not-Microsoft-dominated office, then you probably have a shell account where you can put it. It's surprisingly small, so a spiderhosts account will do just fine for the rest of you. Unless you're doing something screwy, your CVS repository will be under a few dozen MB, so a keychain drive will also work. [Tip on keychain USB drives: avoid puffy ones. I saw a girl almost weep when the only copy of her presentation was on a USB drive which didn't work because the plastic casing got in the way on the laptop she was trying to plug in to. She was also silly for not having backup: when traveling to do the Big Presentation, leave a copy online, on your keychain, and on a burned CD; chances are that one of them will work, but you won't know which until you get there] CVS will work great for that part of your life which is project based, but you'll have to go back to option one to handle your MP3s and family photos. I think you're getting the picture here: the project you're working on these days should probably be in CVS, and then you can carry around a hard drive with all of the not-so-frequently-changing stuff that makes your home directory a home.
|
16 March 05. Complementing your stats package with SQL
The basic principle behind Apophenia is that data should be kept in a database until it's needed, and then just enough pulled out for analysis. There are some things that are much easier to do in SQL, and there are some things that are much easier in a matrix-oriented programming language, and knowing what to do in which context can save hours. ![]() Figure One: Just meant for each other. Things which are easier in a database¤ The mega-dataset, which asked every respondent eight thousand questions, of which you're going to use six. You need to read the whole file to get the data you need, but your computer doesn't have enough memory to hold all 800MB of data. So INSERT every line into the database, which will get written to disk, then select out the little bit you need. Your little laptop won't even break a sweat. By the way, don't forget: if you run regressions on all 8,000 variables, you're a bad person. ¤ Anything involving more than two dimensions. This is really what databases are designed for. You've got one data set which relates cholesterol levels to smoking, and another which relates smoking rates to frequency of getting laid, and you want to show the correlation between high cholesterol and multiple sexual partners. Such merging of data sets is a basic operation in database land, and a total pain in matrix terms. [Select t1.*, t2.* from cholesterol_data t1, sex_data t2 where t1.smoking_rate == t2.smoking_rate. And yes, I know this is wholly spurious statistics; it's an attempt at humor.] ¤ Aggregation. Say you have a few observations of income (between one and ten, maybe) for every ZIP code, and you want the average per ZIP code. Again, a total pain via for loops through a matrix but one line in SQL: select zip, average(income) from data_table group by zip. SQL is limited by not having any real aggregation functions outside of average(), sum(), and and count(), but that's all you'll need 90% of the time anyway. [You can still do weighted sums by things like sum(income * weight).] ¤ Subsetting. Some languages (matlab, octave, R) do a good job of this, but others (Stata, I'm told) have no really easy way to pull subsets of the data a la select * from table where X*Y>.5. As you can see from the example, SQL is built to make this stuff trivial. Things which are easier in a matrix-oriented program¤ The actual math, the regressions and MLEs and such, are not gonna happen in the database, so after you've done all the data-shunting in the database, you'll have to pull it to a matrix for the final analysis. As above, the most math you can do in a database is basic arithmetic, and I haven't yet seen a db program which can be extended with user-defined functions. E.g., no reasonable query will attach a Gaussian-distributed random draw to each observation. ¤ Anything in which the data must be ordered, such as producing a CDF from a PDF. [But time series guys, you can lag in SQL: select (t1.income - t2.income) as diff from data_set t1, data_set t2 where t1.date == (t2.date - 1)] ¤ Real live matrices (as opposed to data sets). SQL is an algebra on tables, but its concept of the product is pretty drastically different from the matrix algebra product. Taking the transpose of a database table just makes no sense. Summary: database methods are not a panacea for anything, but are an excellent complement to matrix-oriented programming languages, because things which are difficult in one are often easy in the other. If you know what is easier in which, you can save a whole lot of your life not having to write little procedures. Policy implications: read up on SQL (many a tutorial out there). If your favorite stats package doesn't handle database operations, you've got two choices: dump it and get to know something like Apophenia, or get a standalone database program and use both. Write the data-massaging half of your scripts in the database program, then write the analysis in the matrix-handling program. If the two programs are worth anything, they should have command-line and text file reading/writing utilities to facilitate this. [link][no comments]
|
22 July 05. Take my program---please!
You're sitting there thinking, `gee, I could use a program to do a certain nifty trick.' You go online, ask your favorite search engine for nifty-trick programs, and get a list of two hundred. You click on a few, watch them fill your hard drive with crap that you can't identify, and click a variety of setup programs which spew garbage all over a still wider range of hard drive. You find one you like, but then it starts popping up `buy my shareware' reminders. Eventually, the whole system crashes, and then you have to dig up lots of CDs to reinstall your office package, your stats package, your graphics package, &c. Six months later, all the cool kids have switched to new software, and you have to start all over to upgrade. The systemsAt this point, those readers who don't run a system with a package manager are hopefully thinking, `gee, I'm missing out.' So, here are some package systems which you may want to consider.Windows peopleCygwin, which gives you a UNIX subsystem, including the X Window system, and loads of packages. The setup is package-based; just re-run the thing to pull more packages. I'm pretty sure the packages are basically RPMs (see below); you can also install some RPMs directly.McPeopleFink, which is a set of ports from Debian. Since all of OS X is a UNIX system, it is reasonably seamless to interoperate with your other OS X programs.LinuxAll linux distributions run a package manager of some sort, so it's just a question of which. The two competing styles are are the Debian APT packages, originally set up by Deb and Ian at Perdue University (I wonder if they're still together), and the Red Hat Package Manager (RPM) system. Those in the know tell us that the APT system is better written; my own experience to date with APT has been better as well. I'm using Ubuntu's distribution right now. But RPMs have improved over the years, and may work fine for you.Another option, which I'm running on my badass number-cruncher, is Gentoo's portage. It downloads source code, and uses the GNU's autoconf system to work out how to compile the program from source on your computer. Very cool and very automated. Unfortunately, the installation system for Gentoo itself is currently underautomated, so it's only useful for things like dedicated badass number-crunching machines. [The GNU autoconf system, by the way, is an absolute revolution in how the world of free computing works. The likelihood that you download something and it compiles and works the first time is exponentially higher if it's built using the autoconf system, and that's what allows everybody to just set a few switches and distribute different versions of the same program for so many systems.] Why they workThere are a few reasons why this would only work in a unixy system and with free software.There's the idea of the library, which all computing systems have. [Windows calls it a DLL: dynamically linked library. Microsoft employees refer to the problem of getting multiple DLLs to play nice together as `DLL Hell'.] The library is a bunch of functions to do something like paint pixels on a screen or read MP3s or calculate Fourier transforms. A full-blown program just gathers together lots of library functions and runs them in sequence. [Really, a program is itself just a function library, with a function named main which will auto-execute.] The Windows OS and McOS before version X never got the library thing right, for reasons I won't bore you with now. But on a unixy system, multiple versions of a library, and hundreds of `em, can live together in harmony. When somebody wants to write a nifty program, they don't have to start from scratch, because they have the libraries at hand---which is a key difference from not-free software, where the WordPerfect people and Microsoft Word people wrote a host of identical libraries. Further, the communal libraries get better with time, since the author of each new program likely contributes a function or two to the library itself. The package concept and the library concept correspond. For those who don't know the system, that's why your package manager insists on first installing a billion things named libwhtvr-0.4 when you try to install a single program. There are few ways to calculate a Fourier transform, but there will always be differences in æsthetic tastes, so there are often many clicky front-end programs that call these libraries. They'll all coexist nicely, because unixy systems (mostly) get the sharing of libraries right. In short, there's competition where it matters, and cooperation where it would be a waste of time. There are a million papers about how this mix of cooperation and competition is delightfully interesting as an economic phenomenon, but those papers are generally not that great, and you don't care---all that matters is that the system has led to a unified scheme to let you load up your laptop effortlessly. [Notice that stats packages don't follow this library-and-front-end standard. There are a hundred out there, and most of them write their own computational routines, and all of them write their own stats routines. Which is why the world needs apophenia.] las quejasThere are a few problems with the package system. Instead of getting your search engine's list of a thousand competing titles of varying quality, you often get only one or two (which call the same libraries). The GNU postscript rendering library messes up landscape views, and will do the same if your front end is ghostview or the GNOME PDF viewer. The GIMP is the only full-scale image manipulation program in the free and packaged world, and if you don't like it, it's tough cookies for you. The converse is that if you do, a lot of people have all put their energies into it, and it gets better literally every day.The package system is generally patriarchal, in that you're depending on the people who packaged the stuff to get around to it; e.g., I've been waiting for Gentoo's people to package GCC-4.0 for months now. Building stuff outside of the package manager is OK, but often causes problems, since dependencies get messed up, blah blah. [I couldn't do it with GCC-4]. A full upgrade when the base libraries like libstdc and the kernel change significantly is spotty---in my own experience, RPMs are the worst with this. [And autoconf is a pain to work with on the developer side. Having attempted to autoconfiscate my own stuff, the documentation is hard to work through, and if your program worked with autoconf 1.7, you may have to rewrite all of your scripts to get them to work with autoconf 1.9. The price we pay for magic.] So, there's the joy of the package system. Try it out. [link][no comments]
|
24 July 05. How to free yourself of patriarchal fetters
A few grad students have asked me to send them some notes on making the transition from a higher-level language like Matlab to C. So, here it is, for the public record. There's some overlap between this and my first post on C, but Ms EB of Washington, Columbia, an Ashtanga Yoga teacher, suggests that I provide more repetition, so here goes. I'm often surprised by how little time I spend talking about C, given that it's what I spend most of my working hours with. I'm not gonna talk about whether C is right for you, blah blah blah. If you're doing a large-scale simulation, do it in C. For organizing your porn collection, there are other languages. The best way to explain how one would transition from a high-level language like Matlab to C is to explain how C is different. The key differences are in the management of memory and in the tools which support C; beyond that there are loads of boring little details. Let me begin with the memory management issue, because it motivates why you might want to learn C to begin with. The pointersThe thing that _really_ distinguishes C from virtually every high-level language is that C gives you two means of dealing with memory, which you can think of as automatic mode and manual mode.In manual mode, you declare that some piece of memory somewhere will hold your variable, and then you get a pointer to that piece of memory. Think of a little garden plot in the middle of nowhere. You don't really care to visit it yourself, but you have a robot assistant you can send out to tend the plot for you. When you send it out to add fertilizer, you'll need to make absolutely certain that you're telling it to dump it in the right plot of land (because outside of the garden, we don't call it fertilizer); if you lose the slip of paper where you wrote down directions to the garden, you can kiss your tomatoes goodbye; if you accidentally tell your robot assistant to plant cucumbers in exactly the same spot, then it'll trample the tomatoes and instead of a tomato plot plus a cucumber plot, you'll just have a cucumber plot and salsa residue. There are a lot of things that could go wrong, so the stats packages help you deal with manual memory allocation by not letting you do it. 100% of memory handling is automatic, and now you can't hurt yourself with those sharp pointers. They're probably right: if I'm a student who just wants to do his matrix algebra homework, I could care less about memory management. However, if you're doing a large-scale simulation, manual memory management becomes essential. Let's say you write a function to shuck corn. With the automatic memory allocation paradigm, you have a big pile of corn at your feet. You know exactly where it is, so it's easy to call the corn_shuck function, by hiring a truck to send your pile of corn to the function. Conversely, under the manual memory management paradigm, all you have is an address, so you send the function a note saying `please go to my garden at 0xbb88d345 and shuck whatever corn you find there.' You could get the address wrong or forget to plant the corn, but if you do it right, sending a note with an address is a much more efficient process than sending a truck. OK, enough with the metaphor: when you call a function with automatically allocated memory, you send in a _copy_ of the data, which may entail some significant labor in making that copy; when you call a function with manually allocated memory, you just send in a copy of the address, which just requires copying a single number. I have some functions which are called literally billions of times over the course of the simulation over on the other screen; in a system which insists on copying in a structure every time, my two-hour run would take weeks to finish. Pointers are cognitively difficult, meaning that unlike the semi-memorization involved in learning syntax details, you will have to learn. You will confuse yourself repeatedly about whether you are referring to an array element or the array itself or a pointer to the array. The syntax won't help, because the declaration of a variable and its use both involve stars, but in different, incompatible ways. The compiler will help, because a char* and a char are different types, so the compiler will yell at you when you screw up. Manual memory errors are confusing because when the robot servant brings back cucumbers instead of tomatoes on line 3000, it could be because you had sent it out to plant cucumbers on line 2000; stare at line 3000 all you want, but you won't find your bug. Because tending manual memory is known to be difficult, there are loads of utilities to help you along. The debugger will be invaluable. If you're lucky, valgrind will work for you. There's electric fence, by a certain open source talking head. The workflowMatlab is an integrated development environment (an IDE), meaning that you never leave the Matlab window to do anything---and golly, you can't, since nothing but Matlab (and Octave) can understand Matlab code. On the other hand, C has been a standard for a few decades now, and that means that there are an absolutely overwhelming number of programs that will help you write C code.You'll need a good text editor in which to write the code. There are many with useful features for writing code like color syntax highlighting; at the least, you need one that will let you jump to line 267 quickly. Then there's the compiler, which translates your text into something the computer can run. Next---and this is not optional---is the debugger, which you'll be spending a lot of time in. There are IDEs for C which will gather all this together for you. I've never dealt with them, so I can't advise you. Ask Google. Oh, and see also the ddd front-end for gdb. The typical routine goes like this: you have the text of your program in one window, the compiler in another, and the debugger in a third (and documentation in your browser). You write a few lines of text, then compile. The compiler yells at you that you've gotten some details wrong on line 267; you go there and end your paren or add your semicolon. When the compiler is happy, you run it under the debugger, at which point your program will run briefly and then crash. You get a trace at the point where it halted, and then go back to your code and fix your error. Repeat until it runs. If the answer looks completely wrong, add breakpoints in the debugger so you can observe your intermediate results. Keep running and modifying until it looks right. This probably sounds a lot like your current workflow, except it's much more decentralized. You'll also be spending more time in the debugger. Life is just too short to not be using a debugger all the time. If something breaks, eyeball your code for a few seconds, and if it ain't obvious, go straight to the debugger. If your current favorite language doesn't have one (you may need to search the `nets for it), ditch the language. The librariesMatlab has a whole lot of functions built in, either to the syntax itself (like matrix multiplication) or as functions they wrote for you (the .m files). C has next-to-nothing built in to the language, and therefore depends heavily on external functions.You're given the standard library by default; here's the standard library documentation. Beyond that, there's the additional stuff for the work specific to what you're doing. The range available is stunning. You will almost certainly want to pick up the GNU Scientific Library; I've written a library on top of it to facilitate some things that I found in need of facilitation, in which you may or may not be interested. This is where make also comes in handy, because you now need to link together your program with these external libraries, and the linking commands get long; make will help you organize them. This will involve some frustration as you try to get the link flags right in your Makefile, but once you're set up, you'll never think about it again, and this host of useful functions will work seamlessly into the rest of your work. Syntactic sugarMatlab and Co. do a lot behind your back, which is why they are easy to start with and which is why they is slower in the long run. So you will now need to do a lot of boring coding yourself. Notably, you need to declare your variables beforehand. People whine about this all the time, but it's just a piece of tædium that you'll get used to. The compiler will warn you when you haven't declared something.C just has no glamour to it. It requires that you cross all your own ts and has no specialized i-dotting syntax to make the code look more like an equation. There's no cute syntax to operate on every element of a vector at once---to operate on a matrix, you'll need to step through every element yourself. This is boring and inelegant. Deal with it. Some people go further and imply that languages which deal with vectors directly are somehow technologically superior, but this is just dumb. These languages may have a few tricks at hand, but in the end they have to iterate item by item for you. Some processors like the PowerPC have a vector type, but unless you're using a vector-based processor, you're stepping through item by item no mater what the syntax looks like. Learning COK, so this should give you a good idea of what you need to focus on to transition from more patriarchal languages to C. You'll first need to get the assorted parts: gcc (the compiler), gdb (the debugger), make, the GSL, maybe valgrind. Lucky I gave a summary last time about how to use package managers to get heaps of software at once.Once your toolchain is in place, you'll need to get to know the tedious syntax, about declaring things and writing for loops correctly. This is easy; just a question of developing some habits. Type 'C tutorial' into your favorite search engine and start reading. I also have a tutorial, which is more conceptual and less nuts-and-bolts. Part of the process is getting to know the linkers and the debugger. This means reading more manuals, but nothing really difficult. Most high-level languages include a debugger, and all debuggers have the same set of concepts, so you can futz around with one in your familiar setting before dealing with gdb. The linker thing is totally absent from most packaged languages, and will annoy you until you learn the difference between -l and -L. Finally, when everything compiles and the environment feels reasonable, you should spend a few hours making a concerted effort to get down the pointer thing. Again, the search engines are there for you, and there are more links from my first post on C, linked at the top of this column. You can see that there's a lot of front-work before you can get stuff done---especially now that the typical computer lab computer doesn't have a C toolchain installed by default. Leaving behind the fetters of the packaged language requires taking on responsibilities you'd rather not care about, but for computationally intensive work, the freedom is worth it. [link][no comments]
|
02 January 06. Object-oriented programming in C
Here are notes on object-oriented programming (OOP) in C, aimed at people who are OK with C but are primarily versed in other fancier languages. What is this object-orientation?The OO framework is in some ways just a question of philosophy and perspective: you've got these blobs that have characteristics and abilities, and once you've described those blobs in sufficient detail, you can just set them off to go running with a minimum of procedural code. If you want to be strict about it, the objects only communicate by passing messages to each other. All of this is language-independent, unless you have a serious and firm belief in the Sapir-Whorf hypothesis.Much of object-oriented coding is distinguished via a method of scoping. Scope indicates what, out of the thousands of lines of code and dozens of objects you've written down, is allowed to know about a variable. The rule of thumb for sane code-writing is that you should keep the scope of a variable as small as possible to get the job done. Think of a function as a little black box: you want it to have as few moving parts as possible and to run independently of the outside world. From the OOP perspective, this translates into dividing variables into private variables that are only internal to the object, such as the internal state of the car's motor, and things that the whole world can use, such as the location of the car. Thus, every OO language I can think of defines public and private keywords. But wait, there's more: sometimes, you really have to break the rules, just this once, and check the internal status of the motor. You can make the status variable global, defeating the whole mechanism, or you can define a friend function. Below, we'll have inheritance, and will also need protected scope. Sometimes, the :: operator will get you out of a jam. That is, we can divide the OOP additions to C's syntax into two parts: syntax to give you stricter, finer control over scope, and syntax to override those stricter controls. How does C do it?The scoping rules for C are defined by the file. A variable in a function is visible only to the function; a variable outside the functions, at the top of a file, are visible only in that file.A typical file.c will have an accompanying file.h that simply declares variables and functions. If another file includes file.h, then that file can see those variables and functions. As for the naming, objects let you name functions things like move and add and never worry about interfering with fifty other functions with the same names. This is nice, but there is a simple C custom to take care of that: prepend the object name. Instead of the C++ my_data.move(), where you just understand that this move function refers to an apop_data object, you'd have a function called like apop_data_move(my_data). There ya go, crisis averted: no name space clashes. Some readers somehwere may complain that the name-prepending is ugly, or that such naming requires programmer discipline without compiler checks, to which I respond: if you're worried about these things you're even more boring than I am. But seriously, go have a look at Joel the guru for more on how wonderful naming similar to this can be. So C already has a scoping system that matches C++ if you use the one file-one object rule and a few customs in naming. Thus, adding a whole new syntax for scoping on top of this is basically extraneous, and could create confusion now that you've got two simultaneous scoping systems in action. Inheritance and overloadingOverloading functions and operators is dumb. Joel's article above has a humorous bit about this, which opens: "When you see the code i = j * 5; in C you know, at least, that j is being multiplied by five and the results stored in i. But if you see that same snippet of code in C++, you don't know anything. Nothing." The problem is that you don't know what * means until you look up the type for j, look through the inheritance tree for j's type to determine which version of * you mean, et cetera.Say you have a blob object, and think, faux pas, that it is a blobito object. You call the my_blob.cleave() function. In C, you would be notified of your error at compile time (you'd be calling blobito_cleave(my_blob) when you should be calling blob_cleave(my_blob)). In many interpreted languages, you would be notified of your error at run time, or sooner depending on the language. In C++, with appropriately defined methods, you would never, ever be notified of your error. That is, operator overloading allows you to bypass a large number of safety checks. It's probably the case that blob::cleave and blobito::cleave do somewhat similar things, so the effect on the output may be wrong in subtle ways. Option B is inheritance via composition. For example, Apophenia has an apop_data type: typedef struct apop_data{In OOP-speak, the apop_data structure is a multiple-inheritance child of the gsl_matrix and apop_name structures (plus an array of strings). All of the functions that operate on these parent objects can act on elements of the child apop_data structure, and life is good. On the one hand, this means that if a function acts on a gsl_matrix * you can't transparently call, e.g., apop_sv_decomposition(apop_data_set); you have to know that there's a gsl_matrix inside the data set and that's what's being operated on: apop_sv_decomposition(apop_data_set->data). On the other hand, you can not accidentally call the wrong instance of the function and then spend an hour wondering why the function didn't operate the way you'd expected. On the minus side, the internals of the object aren't hidden from you—but on the plus side, things aren't hidden from you. the void, templatesAnd finally, for when you really don't want to deal with types, there's the void pointer. Here's a snippet from Apophenia's apop_model type: typedef struct apop_model{Two things to note from this example. First, including a function inside a struct is a-OK. We'll declare something like: apop_model apop_GLS = {"GLS", apop_estimate_GLS, ...}; and then we can call apop_GLS.estimate(data, NULL, sigma); just like we would in C++-land. Second, there's the void pointer at the end of the declaration of the estimate method. Here's the declaration for apop_estimate_GLS: static apop_estimate * apop_estimate_GLS(apop_data *set, apop_inventory *uses, gsl_matrix *sigma). Notice that that third argument is typed as a gsl_matrix *, even though we're plugging it in where the template asked for a void *. Non-OOP quiz question for the statisticians: why is this a terrible way to implement GLS? Other models require different parameters, like the MLE functions take parameters for the search algorithm, but they're also called via the same model_instance.estimate(data, NULL, params); form. Notice also how the function pointed to is declared to be static, so outside the file it's only accessible as the object.estimate() method. In short, the void pointer is your way of saying "Dear C type-checker: leave me alone." The type-checker will still check that you're sending a pointer and not data, but from there you're free to live it up. I can't recall ever using this, but if you wanted to, you could even type-cast inside the function: void move(void *in, char type){The void pointer is how you would implement template-like behavior. For example, here is a linked list library (gzipped source) that I wrote when I was avoiding harder work. It links together void pointers, meaning that your list can be a linked list of integers, strings, or objects of any type. How's that for a nice, concrete example. This selfI've only wanted something like the this or self keyword maybe twice, but I have no idea how to gracefully implement it in C, if at all. [Maybe with the preprocessor?] So I'm open to suggestions on this one.RefsMore essays along the same lines:A full book that goes into great detail about the above simple tricks, and also goes much further in implementing something that looks like C++. An article that focuses on encapsulation, with some suggestions on hiding data. Another article that blew way past my attention span, and basically shows you how to write a C++ compiler in C. Given my disdain for overloading and strict inheritance (as opposed to inheritance via composition), I wasn't really into it. OK, there you have it: most of the basics of object-oriented programming implemented via relatively simple techniques in C. The moral: object-oriented coding is a method and a mindset, not a set of keywords. [link][no comments]
|
08 January 06. Why word is a terrible program
I was writing a blog entry about proselytizing, which I dislike and may not post, but one point that came up is that the only thing that I actively proselytize, the only thing that I really want other people to do differently, is to use a semantically-oriented document preparation system.
|
06 May 06. The schism
Those of you who actually read my posts about efficient computing, rather than just going to read the comics at the first sight of the word `computing', may by now have noticed a few patterns. The most basic is that standards are important. I know this sounds obvious to you, but if it's so obvious, why do people get it wrong so darn often. Why are people constantly modifying and violating standards that work just fine? I know many of you have suspected this for a while, but let me state it loud and clear: I am conservative. Rabidly conservative. I think that people need to have a really good reason for not conforming to technical standards, and I think most people don't--they just use the shiniest thing available. A large amount of my writing on technical matters is simply pointing out that well-thought-out technical standards tend to work better than the newest and shiniest, and that the value of stability often more than makes up for any flaws in the standards. Even my work on patents is aimed at making sure that open standards remain open and free to implement. I originally tried to make this into an essay about both computing standards and general customs, but over the course of writing it, I came to realize that the two are fundamentally different. If somebody doesn't quite conform to your human customs--if they use the wrong fork or speak non-native English or wear ratty t-shirts to the office--then the person will be funny or diverse or annoying or just normal. Meanwhile, if computing standards aren't followed--if somebody gets sick of C's array notation, array[i][j], and decides it looks nicer as array[i, j]--then their writing is 100% gibberish and they might as well be speaking Hindu to an English-speaker. Standards-breaking in social settings can be fun; standards-breaking in computing is just breaking things. So although I usually try to put something in the technical essays that will be interesting to those who could care less about machinery, I don't think any of the below is truly applicable to social norms. Or you can read on and decide for yourself. Nor is this a comprehensive essay on standards drift and revolution, because that would take a volume or two. Just file this one as assorted notes on one question with an interesting proposed solution: what to do with all those people who keep trying to revise and update and modify the standards?
SchismsIntuitively, there's the English-teacher approach, where we force everybody to stay in line with the basic standard. When you go home to write your pals, your English teacher instructed you, be sure to use perfect grammar at all times.But another approach is to let the whippersnappers fork. On the face of it, it may seem contradictory to think that splitting a standard in half would somehow make it purer, but under the right conditions, it can be the best approach. For any technological realm, you've got one set of people who just want features--lots and lots of features, enough to wallow in like they're a bed of slightly moist hundred dollar bills--and you've got another team that wants fewer moving parts, and takes care to maintain discipline and stick to the existing norms. We can bind the two teams together, in which case they will constantly be fighting over little modifications to the system and neither team will be happy. That's what happens with English. Or you can have the schism. Allow me to cut and paste from Amazon:
The C Programming Language
by Brian W. Kernighan, Dennis
M. Ritchie
First edition
228pp, 1978:
The C++ Programming Language
by Bjarne Stroustrup
First edition Amazon.com Sales Rank, paperback: #1,243,918 Things we conclude: C++ is much more complex than C--274pp v 911pp. C++ keeps evolving: from 1986 to 2000, the book has had three editions, over which it has tripled in size. People are still buying the 1978 edition of K&R C because it's still correct; the first edition of Stroustrup is so incompatible with current C++ that people can't give it away. Finally, Prentice-Hall really needs to lower the price on the hardcover edition of K&R. I mean, my book is selling better than their hardcover, which ain't right. Meanwhile, C is as stable as can be. Cyndi Lauper has put out seven albums since K&R C came out. The changes from first to 2nd ed. of K&R are pretty small--literally, they're a fine print appendix. And, I contend here, it owes its immese stability to Bjarne Stroustrup. With Bjarne putting out a new version of C++ every few years that frolics along with still more features, Prentice-Hall is free to reprint the same version of the C book without people whinging about how it's missing discussion of mutable virtual object templates. The guys who want simplicity and stability buy K&R and the guys who want niftiness and fun features buy Stroustrup and everybody's happy. The other technical standard I use heavily is TEX, and I'd been meaning, for the sake of full disclosure, to give a critique of TEXcomparable to this here critique of Word Fortunately, Mr. Nelson Beebe already did it for me, in this (PDF) essay entitled 25 Years of TeX and Metafont. The article alludes to exactly the sort of schism in typesetting as in general programming: you've got the people who are totally ignorant of standards and just want the shiniest new thing, and the people who built a standard system that has been stable for the better part of 25 years. Since he's on the standards-oriented team, he gives many examples of how such stability has led to large-scale projects that have significantly helped humanity. His discussion of its limitations is interesting because there really are features that need to be added to TEX--notably, better support for non-European languages and easier extensibility. But "TEXis quite possibly the most stable and reliable software product of any substantial complexity that has every been written by a human programmer." (p 15) Changing a code base that hasn't seen a bug in fifteen years is not to be taken lightly, so the process raises interesting questions.
EvolutionSo when you read about the raging debate between Blu-ray and HD DVD (I'm rooting for the one that isn't an acronym), don't think `oh, now I have to worry about all my stuff being obsolete'. Thank those guys for distracting attention from DVD, which is a nice, stable format that hasn't changed in a decade, ensuring that your stuff has not become obsolete. People have made haphazard attempts to revise the CD format, but thanks to distractions like the MiniDisc and even DVD, your copy of Cyndi Lauper's first album is still the cutting-edge CD standard (specified in The Red Book, 1980), while attempts to subvert the CD standard never took off. Remember CD+G? If so, you're the only one.So how do conservatives evolve? Are we trapped in using standards from the 70s forever more? Of course not. But the evolution is not from clean standards to floundering in pits of features, but revolutionary breaks from old clean standards to new clean standards. The feature pits are just distractions.
The process of evolution via incremental fixes to follow the trends
has an unimpressive track record. Corporate-sponsored standards often
suffer this failing (but not always), because setting standards that
last for two decades and selling frequent updates are hard to reconcile.
One company spent a while there naming its document standards with a
year--standard '98, standard 2000, et cetera--which in my
book means none of the formats are actually standard. The right way
is to ride a system until it really doesn't do what you need anymore,
and then revolt, building a new one that is clearly distinguished from
the old, as we saw with DVD's overthrow of CD because CDs truly can not
store movies, or Ω
The trick is to know when to revolt. When is a new trick so valuable that
the old system should be abandoned? Many a dissertation has been written
on this one, and I ain't gonna answer it here. But for well-thought-out
technical standards, it's much later than you think, as demonstrated by
the active 25-year old standards above.
|
18 October 06. How to pick a computing language
I can not stand how much debate there is about computing languages. I hate the fact that the Web is filled with it, I hate the fact that so many postings on Usenet in the way of `I need a Matlab routine to...' get replies like `Why aren't you using Algol!?', and I hate that my own work is so often evaluated based on choice of computing platform rather than actual output. So, in my little effort for world peace, here are my notes on picking a language. I'll generalize this a bit next time, but the basic theme is that there is no One True Way. The process of picking a language is picking which is the least annoying trade-off, over the course of a series of many trade-offs. The moral here is that, even though you will no doubt have a preference on one side or the other with all of the debates below, there is indeed a sensible other side, that other people prefer. That is, there are languages on both sides of any debate because there are valid reasons, both in terms of æsthetic preference and in terms of practical issues, for picking both options. Anybody who tells you otherwise, for example insisting that we must all use dynamically-typed languages from now on, is just being an ass. Pick the least annoying paradigm for yourself, and let your neighbors pick the least annoying paradigm for themselves. We'll all get our work done in the end. Having ranted, here's a list of a dozen primary axes along which general-purpose computing languages differ. Work out which side you prefer, find a language that is on the same side as you, and go code.
1. Are the libraries what you need?The primary joy in using an existing computing language is the strong hope that somebody has already written the code you need. However, I have never seen a language that really has a good library for everything I've wanted. I see the big schism as languages with lots of libraries for numerical and otherwise number-crunching routines versus languages with lots of support for Web handling, though you will no doubt find your own divisions among what type of code base supports what languages.
2. Does it assign types dynamically or statically?The evangelists here all seem to be on the dynamic-typing side, but the issue is more muddled than a one-side-or-the-other split, and neither extreme is great.If a system is |