Patterns in static

Some fluff, some info

navigational aids:

topics covered:

02 December 03. RTFM [Read the manual]

In which the author bitches, and then gives practical advice that will save (segments of) the reader's life.

OK, so here's some quick math that I worked out on a little spreadsheet. Say there's some little routine that takes you five minutes a day, and you could do some tedious work and eliminate that through an hour of research into correctly automating the thing or something. Then over the course of a work-year, you'd save 2 and a half days of time. In the words of the great Margaret Cho, you could take a pottery class. So some research up front can pay for itself many times over; this is nothing your momma didn't already tell you.

But when people sit in front of a computer---the paragon of automatability---people instead want something they can use immediately. A good program, we read, does exactly what the users thinks it should from the start (i.e., is intuitive). It should require no learning of new methods, and should instead analogize to real-world actions like pointing to things (i.e., clicking on pictures).

So if I've set up this essay correctly, it should be completely obvious that the first and second paragraphs there are in direct contradiction. From the perspective of efficiency, intuitive design is not helpful. On occasion we're lucky and the most intuitive and most effective method match, but in most cases we need to learn some new little method to implement the most effective route.

I am flabbergasted, horrified, and in despair over the resistance people have to learning how to use software. People will read the entire frigging Chicago Manual of Style rather than learn to use new software (see below). They're using this one program because in 1997, when they were afraid of computers, they could pick up that program and use it in the first hour---but it's been six years and they've used that program for a thousand hours a year, and they're no more efficient at clicking the little icons, and they spend more and more time trying to do increasingly complex tasks until they realize that there's no intuitive way to efficiently do the complex things they have to do.

Oh, the mortal cost of that first hour! What a seductive compact with the devil were those icons! The little sound effects were so pleasing, yet the software proved to be a siren which drains away all life, one pleasing mouse-click at a time.

We should face facts: we're in front of computers, all day, every day. That doesn't mean we like them, but if we like what they do for us, then it's really worth an initial effort to work out what's good software---not in terms of what matches your immediate intuitition, but what you can learn to use efficiently as you use it every frigging day until you finally reach the ever-growing retirement age and can finally just stay home and surf the Web.

Why I care One: I have to give tech support to people all day long. The fewer people using cute but annoying software, the less tense I'll be. Two: I want better software that I can use. All that effort that goes into making better eye candy for Word is wasted for me. Three: ``No [person] is an island,'' wrote John Donne, and it truly, physically pains me to hear the travails of somebody who wrote their dissertation in Word, which you'll recall the New Yorker describing as a terrible program.

Things you can do Here are some things that are not intuitive, require reading the manual, and will save you huge tracts of time. Listed in order of commitment. You don't have to read the whole manual for any of this; you don't have to memorize anything (just leave the manual open for reference somewhere and you'll remember the important stuff soon enough); you just have to learn enough to understand the basic framework and to know where to look for more info. Geeks will notice a thematic relation between this and the essay of 10 November, q.v.

Stop using the mouse. Most of what you can do with the mouse you can do with the keyboard, such as switching applications ( +), or dealing with the menus (+underlined letter). Not taking your hands off the keyboard is the sort of thing that will save you 1.2 seconds two hundred times a day, for a total of one free coffee break. People used to stand behind me when I was working with multiple applications, amazed at how quickly I could do stuff without using the mouse. They seemed to think it was magic. I thought it was funny.

[And you have learned to touch-type, no? Me, I learned by putting a t-shirt over my hands so I couldn't subconsciously peek.]

Use style sheets. Witin your word processor (e.g., Open Office), there's a feature that gives you a list of header types. Instead of specifying `boldface, larger, new font' for every header, you can instead select `header 3' and the rest is set up for you. This requires initial setup and some cognitive effort, but once you've set up your headers the way you like them, you never have to do it again. When your advisor/boss tells you to change all your italicized headers to underlines, you only have to change it in the style sheet, instead of hunting through a hundred pages looking for all the italicized headers. [In Open Office, this is `the stylist'; no idea what it's called in Word. RTFM.]

Never, ever, write a bibliography by hand. Intuitive, direct method: read the Chicago Manual of Style and learn the rules for italics, punctuation, and ordering. Efficient and effective method: use a bibliography database. LaTeX has bibtex; Open Office has a built-in bibliography manager; for about a hundred dollars, you can purchase EndNote for Word. [The lack of a bib DB is reason enough to dump Word. Thinking about the pain my mother sufferred writing her bibliographies in Word makes me well up.] You have to set up your bibliographic info in a database, then insert formatting codes into your document, and then you're guaranteed that every last comma and period will be in the right place, and that when you refer in text to Schweitzer[15], that Schweitzer is not number sixteen or fourteen in the bib.

Stop using your spreadsheet as a database. Spreadsheets make it easy to instantly create a list of things. The same guy I linked above (Joel on software) worked on the MSFT Excel team. There they were at MSFT, designing amortization wizards (because amortization should be intuitively obvious), and one day they watched some users actually use their stuff, and found that most users just used their software for writing lists; so they added lots of list-making and list-handling features. However, there's actually an entire field of program built from the abstract algebra up for making and organizing lists: databases. As with everything else I'm talking about, databases are conceptually unintuitive, require setting up before you can start writing lists, and will save you hours and hours of your life in the long run. You may have a copy of MSFT Access on your hard drive right now.

[Data geeks: instead of using SAS or whatever, check out SQLite; between that and the GNU Scientific Library, I have the data analysis package I'd always dreamed of. Since 10 November, I really did switch all my models and statistical analyses to C, and the effort has already paid dividends.]

As a matter of fact, dump Word entirely. Even ditch Open Office, which is nicer but still the same ballpark. LaTeX is the best document preparation system in existence, by far. Many authors typeset their books in it. Same as above: initial learning curve, new paradigm, won't get anything done the first few hours, but will save you weeks of pain in the long run.

[There are graphic shells for LaTeX, like Scientific Word. They unabashedly suck. I have pals who write their complex equations in SciWord and then cut and paste the resultant LaTeX codes into their main text-edited document, which seems like the optimal use of these things. People who LaTeX from Windows are fans of WinEdt, which has some nice LaTeX-specific features.]

No, better still, dump Windows entirely. Do I need to say it by now? Windows is aimed at being intuitive from the get-go, while the Unix-type systems are aimed at automating and simplifying your work and then getting the f*ck out of the way. If you have a life sentence with one paradigm or the other, forced to spend a thousand hours a year chained to a keyboard, which would you choose?

Windows people: try Cygwin, which installs a Linux box within Windows, so you can use LaTeX there, and produce beatiful PDFs that you can pretend you wrote in Word. [That's how I'm writing this now, so if a Windows fascist comes by, I can minimize Cygwin and pretend I'm using Windows. In my experience on this machine, Windows & Cygwin coexist nicely.] MacPeople: if you're running OS X, you've already got a Unix box; get to know it and the non-Mac software which will help you do the above.

The commitment here, by the way, is not in installing the OS, since Unix will coexist with Windows or OS X, but in learning the myriad or two of programs which Unix facilitates. If you're still avoiding work, have a look at my essay on Unix and its users, linked at right.

[link][no comments]

Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:

06 February 04. A lament about bad design

OK, this was going to be a lament about any of a number of things, with a general discussion of how to design things to be useful but not annoying. However, I fear that it's going to turn in to a rant about MSFT. Sorry, guys.

It's about designing things for dumb people. This is, by itself not a bad thing. After all, designing for less cognitive effort benefits us all, just as features designed for the handicapped are often embraced by able people who just want it easier. [I've been using the twiddler lately; it would be most useful for people who only have one functioning hand, but I like it because it lets me drink tea and type at the same time.] Or, at the other extreme, here is an article about how an unintuitive interface killed John Denver.

The two prime examples of this would be advertising and MSFT products. I think the whole thing about idiot-proofing Windows has been discussed to death, and needs no elaboration here. I've already talked about how advertising has gone from a long textual evocation of the product (with bold headlines for those who are just skimming) to a picture of the product being held against a pair of breasts.

Don't get me wrong, intuitive interfaces and things that dumb or inattentive people can readily digest are not necessarily bad. But what makes them horrible is when the design makes it impossible or too difficult for not-dumb people to go beyond the dumb level.

For example, when waiting for a subway train, I am confronted with a number of large, backlit ads right in front of me, and I typically have about ten minutes to kill. This is the perfect opportunity for the vendor to tell me all about the product, in great, backlit detail. And yet all settle for a picture. and a tag line. with inappropriately placed periods. where a comma or hyphen will do.

This is true for both SBUX, which will just show you a picture of a cup (which we presume contains coffee) or Boeing/McDonnell-Douglas, which outbid SBUX for ad placement at the Pentagon Metro station, and advertises bombers and helicopters. Surely there's more information that we need to know about the latest bomber than about a cup of coffee?

But the advertising won't tell me. If I care, there's nothing for me to do for ten minutes but to stare at the picture some more. Oh, I could look elsewhere, but I'm not elsewhere. I'm on a subway platform, waiting. I want to see the information that got thrown away in a desperate attempt to get the point across with a minimum of cognitive effort, and am frustrated that I can't.

The other prime example of this is of course anything written for MSFT Windows. It's easy to sit down and use, which I would be an arse to be annoyed by, but it's supremely difficult to go beyond the easy stuff. Spent an hour yesterday trying to get the cute little browser thing to hide files beginning with a dot. I even wrote Dell tech support, who blew me off. As you can plainly see, if I want something that seems possible, but isn't, I will be frustrated and unable to continue to function as a normal human being. My favorite foil whom I've linked to before, Joel, goes on and on about how frustration comes from having things that don't work the way you expect them to, adding little bits of cognitive effort and annoyance to your day. Joel probably describes many people, but I am most frustrated by tools that just plain don't work. Screwdrivers are truly counterintuitive, if you ask me (to make screw go out, turn counterclockwise?), but I learned the righty-tighty/lefty-loosy thing. When even that doesn't help, (like the screw is upside-down or with the few reversed-thread nuts on a bike) I am indeed frustrated. But I am infinitely more frustrated when the screwdriver is made from cheap metal and bends when the screw is too hard to undo.

Implicit to all of these things is a promise: I will tell you about my product; I will help you make your document look just right; I will unscrew your screws. Sometimes getting that promise to work takes some compromise from both sides, which is how life is. It's not the compromises but the broken promises that really hurt.

Somewhere, I read about how temperature gauges are less common on cars now, since somebody worked out that most people interact with the gauge by just looking to see if it's in the red, and panicking if it is. So why bother with a gauge? Instead, you just get a little light that tells you when the temperature is in what would have been the gauge's red part. I told Mr. DRC of Santa Monica, CA---a car expert if ever there was one---about this, and he had a hissy fit, listing three dozen things you can learn from a temperature gauge beyond whether it's in the red. So in following Joel's advice about minimizing cognitive effort for 95% of drivers, the other 5% are frustrated and dejected.

It doesn't have to be that way. Design that includes the lazy doesn't have to exclude those who care, and if it does, it's as bad a design as one that only makes sense if you study it for an hour.

I tried to come up with more examples of where things have been redesigned for the lowest common denominator and thus shut out those who care, but couldn't think of anything really good and pervasive. Television has always been written for dumb people, and since there's a time constraint, you have to pick your level of information and stick with it---unlike a print ad, it's physically impossible to say more. There are thousands of books with `for Dummies' in the title, but there have always been such how-to books, and for every such book, there's another that goes into all the detail you could want. This is even true of management books, which are typically the most supremely oversimplified books in existence, since businessmen often have a pompously overinflated idea of what their time is worth. Perhaps you, dear reader, can leave some suggestions in the box below.

Meanwhile, I have nothing but a lament about the two realms where withholding from the consumer is vehemently defended as a good thing: working with PCs, and advertising. One particular item stands out as the intersectionof the two: MSFT PowerPoint, a computer program for creating advertising presentations. Its design makes summarization and mimimization of cognitive efffort easy and information dissemination difficult. E.g., as a counter to the too-difficult design interface which caused a disaster above, PowerPoint's design is partly responsible for the destruction of a Space Shuttle.
[link][2 comments]

on Friday, September 29th, Lure Knightstalker said

Items in an average day which have been LCD'ed(Lowest Common Denominator), yes some are subsets of others.

Windows, all versions
Any chat program which changes emoticons to real icons for those who can't read them
Cell phones(good thing, but overdone), particularly cellphones with multifunction buttons(top of button is up, center is select, when on a call it is end call, when not on a call...)
Microwaves(modern, not old) Microwaves used to have 1 or 2 dials(time, power) and the button to help open the door. Now they have overly complex sets of keypads. This in and of it'sself is not a bad thing, but when it makes me take more time than the minimum(place food, close door, twist dial) I see that as a bad thing.
Cars(good thing, no more choke or messing with the accelerator to turn on)
The list goes on, but theese are the most egregious examples I can think of right now,

on Saturday, January 19th, Lure Knightstalker said

Another one: the front panel of a VCR. Used to be you could do anything from the front panel including play record setup functions, change channel etc. More and more there are stunted front panels which allow play ff rew eject and power, but have no way to enter the menu, change channels, or any other functions. Tvs dont' have this problem, they always had volume channel, and power and they still do, for them it's been feature growth, and often times the menu and front panel will still allow you to access all functions of the tv(thanks to a menu button)

Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:

14 July 04. RSS, again.

So back in April, I'd written about the joy and delight of the RSS feed. The summary: whereas the web had once been an endless pit of time-consumingness, RSS made it manageable and easy. Whereas I'd spend all day in glassy-eyed clicking before, I could now spend a little under an hour reading everything I could possibly read, and could then move on to do things involving the real world.

Oh, how times have changed. I now have so many RSS feeds that it is a truly Sissyphean task to read them all. Whereas before I could have a set, fixed endpoint (`stop when I've read all the feed updates'), this is now impossible. Meerkat, O'Reilly's wire service, will feed me a thousand links a day. Seriously reading two percent of that is already an hour. And I haven't even gotten to the newspaper yet: the New York Times will feed me a hundred articles a day, of which I'll want to read a whole lot more than two percent.

In other news, I've entirely stopped reading anything that doesn't have an RSS feed. I paid some guy to write up an RSS feed for Toothpaste for Dinner because it's funny in a severly embittered kind of way, but with no RSS feed pushing it upon me, I never looked at it. So I'm more lazy, but not actually saving more time.

The other thing that amazes me about all of these RSS feeds is the immense repetition. First, there's direct copying: Meerkat aggregates other RSS feeds, without editing, and puts them out for you in one feed instead of several. [And so, given that feed readers are now a dime a download, I'm not sure what its point is anymore.] And of course, other blogs frequently have entries among the original content in the way of `so over at this blog, they say...' without really adding much of anything.

But beyond that is the original generation of the same idea that ten other people also originally came up with. As some of you may know, I'm writing a book on software patents, so many of my feeds are about intellectual property, and frankly, the news is almost exactly the same on all ten of `em. Even the stories themselves tend to repeat; I've simply stopped reading DMCA cases, they're all so alike (as is the outrage they inspire again). This form of repetition feels even sillier than the blatant copying above---at least there was no effort expended in copying. Here, there are extensive write-ups which boil down to the same facts and the same emotions; if these guys all teamed up, they'd have one feed with the same content and a tenth of the effort.

In my own head, this turns into a constant pressure to not repeat myself or others. The `others' part is frankly kind of easy: since all ten of our IP authors look at IP in the same way, I really only have to distinguish myself from one mindset. The `self' part is getting harder and harder. This is my 75th blog entry, and I simply don't have 75 actual real live ideas. This blog has been up for almost a year, but others have been up for the better part of a decade---how do they do it? It constantly worries me, and is the reason I've been posting less lately: when am I going to hit that age when everything I say is just repetition, and have I already hit it?

[link][a comment]

on Thursday, July 15th, Paul said

And yet somehow, there's another new idea! How do you do it?

Some writers have supposed that there's a finite number of stories to tell, and everything is just a variation on one of those. (36 is one number:

And yeah, I struggle with just linking to stuff. I hate to do it--I feel like I'm cheating if I'm not providing something original--but I also know friends who don't surf compulsively, constantly, and who actually find stuff through me. Like Angelique will be really thankful for a "curated" presentation of political links.

But yes, it feels kind of pointless sometimes. Like anybody should be able to pull up and know 95% of what's in bloggoworld w/o ever even reading a blog.

If anything, I alway try to--if I'm going to link--link to more obscure or local or small-scale stuff. Like "How to Argue with a Conservative," e.g. (By the way, I want a conclusion!)

Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:

24 October 04. How to live with multiple computers

[A dump of a large number of small pieces of advice from years of shunting data. File under `boring but useful'.]

Problem statement: You work on multiple computers: home and away, or a desktop PC and a laptop, and you want to use both to work on the same projects, so you need some way to reliably shunt data back and forth. Me, I have three home laptops in various states of brokenness, a PC at (name of think tank) and a PC at (name of university), and a few accounts in the ether somewhere which I log in to from time to time, and all that has to keep reasonably synced up.

Buried into all of this are a few regimes for backing up data. Not to be mean or anything, but to those of you who aren't backing up their data regularly: what are you, fuck*ng stupid? If your work is worth doing once, it's worth making sure you don't have to do again. [Personal to Ms. JATMM of Mount Vernon, VA: my continuing condolences. I don't mean you.]

Step one: have a home dir: Unix geeks, you already have a home directory. But you Windows users probably have data spewed all over your hard drive. Many programs like to make their own little directories for storing data, so your MP3s are in c:\Program Files\Brand Name\Program Name\Data, while your great unfinished novel is in c:\Another Brand Name\Word Processor\user Xp458yjz\documents\. Get them all in one place, which I suggest you name c:\home. [If you have multiple partitions, use the one that the operating system isn't on, which will probably be d:\home. If you don't know what a disk partition is, don't worry about it.]

MacPeople, you also have a home directory, but OS X tries to hide it from you. You have a Unix command line (the Terminal, in the Utilities folder); get to know it. Since I don't have an iAnything at hand, maybe somebody else can offer more suggestions in the comment boxes below.

Put all your data in the home directory. Make yourself c:\home\novel, c:\home\music, et cetera, and stick with the plan, even when your programs try to distract you. Linking (aka shortcuts) can help with this: e.g., set a shortcut from the Desktop directory (wherever it is; Microsoft keeps moving it) to c:\home\desktop.

The idea here is that if your computer falls to pieces, all you have to do is reinstall the software and then copy back your home directory, and it will be as good as new. You probably have reinstall disks for the software somewhere; you're basically making yourself a reinstall disk for your work.

Further, the costs of hard drives will hopefully be in sync with the amount of crap that you increasingly collect over the course of your life---and so you'll never have to throw anything out, ever. Instead of putting your projects on floppies in the closet, you can keep everything you've ever done in your home directory, perhaps archiving from time to time into an archive subdirectory. Me, my /home/b/arch directory includes everything I've done on a computer since 1999. This comes in handy more often than you'd expect.

Also, dear Windows people, you may want to get a copy of Cygwin, which will let you execute the Unix commands below. You want this to be as easy as possible, and writing a batch file to run the command will, in the long run, be a lot easier than clicking lots of little boxes in WinZip every time, despite the initial setup. Also, children and animals will like you more.

option one: Directory replication We have two options here: the first is to find a place online, and the second is to carry around a physical object.

suboption one: Online: this web site is hosted by I pay fifteen dollars a year and get thirty MB of storage. [Tell `em I sent you and I'll get a discount on my renewal.] That's enough for a lot of the important stuff. There's also, which will give you a frigging GB, and if you need more, I guess you can just get two accounts. Depending on the status of your connections, the internets may or may not always be available; this obviously won't work if you're not ubiquitously networked. [Personal to Mr. PH of Seattle, WA: now that Presidential decree has pluralized internets, I think it should be lower case.]

subsuboption one: with compression One line:
tar cvz --exclude /home/b/mp3s /home/b > my_home.tgz
Then email my_home.tgz to your gmail account, or what have you. When you get to the other computer, you can decompress by going just above your home directory (like c:\) and typing tar xvzf my_home.tgz
WinZip or StuffIt or what-have-you will work OK, but there's no easy way to tell it to exclude c:\home\mp3s. Maybe you can make it work.

subsuboption two: without compression Use rsync, which will only transfer the changed part of files which have changed, making for a very quick transfer:
rsync -avPe ssh --delete --exclude /home/b/MP3s /home/b/
This is especially nice if your home or work PC is always on, in which case you can just transfer directly. Check your router to see your home IP address (which is not reliable because most DSL/cable companies will change on you to keep you from doing this sort of thing. If your ISP uses PPPoE instead of DHCP, complain.), then turn on sshd on your PC (easy if you're unixy; I'm told cygwin will also let you just type sshd at the command prompt but never tried it myself). All you need is one of your two PCs to be identifiable by an IP address and have sshd running and you're good to go.

The --delete option gives me a rush of fear and adrenaline every time I use it. But even without that, this is dangerous because if you reverse the from and to directories, then you'll overwrite your new stuff with old stuff, thus losing all the work you just did. Nothing sucks more. With compressed files, you can of course have the same problem.

More advice: think of one and only one drive (your home hard drive, work hard drive, your portable drive) as your primary home directory. In others, set up an xfer_me directory of things that need filing back to the primary drive. The assumption is that if any of the non-primary computers go up in flames, you won't care. Day to day, you just need to transfer the xfer_me directory back to the main from time to time, and then overwrite the non-primary home directories with fresh copies of the primary now and then.

Another useful tidbit: if your computer is a laptop, and you have a home network, then both of your computers are sharing a subnetwork. Your desktop probably has an address like and your laptop something like If one of these can run sshd, then you can directly rsync directories, and it'll be rather zippy.

suboption two: physical device: This would be something you carry around with you. Pretty much anything will work: those little USB keychains, your MP3 player, even your digital camera can probably store files for you. Portable storage is frigging everywhere. The obvious problem is that if you forget it at home, you're screwed, but it will of course work if you have no Net connection. Also, don't forget the cable, which I've done embarrassingly often. [Spare tip for buying a spare cable: A bit of poking around online should find you the formal name of your cable---more and more are A-mini B---and you'll find that you pay a lot less for an `A-whatever cable' than a `cable for brand name device'.]

Devices that your operating system recognizes as just another drive are the best; some devices (Kodak cameras, Apple iPods, Creative MP3 players) require additional software, which will require installation everywhere you want to use it---annoying. On such devices you may have to archive your home directory into one file (see subsuboption one, above) and then copy that single file over.

The world of devices falls into two categories: flash memory based and hard drive based. With a hard drive based device, you may as well make _that_ your home directory, since you should have enough space to do so. The only problem is that there's a slowdown with external hard drives; unless you've got really good equipment, it's noticeable.

[Me, I have an Archos Jukebox, which mounts normally and has the imprescindible feature of rubber bumpers. [That's Spanish for `can't-pass-up-able'.] I partitioned it into a vfat partition for the music (`cause the firmware demands it), and a reiserfs partition so I can have the pleasant features of real-live links, permissions, and a journal.]

Flash-based devices tend to give you much less storage---you'll have to leave the MP3s at home, and just back up the important stuff. That's OK with the above methods, since both give you options to exclude directories (which is why you want to do this from a batch file, where you can specify what to exclude once and for all). Not a problem, but don't forget to back up the data you're not carrying with you some other way.

Since you're not networked, you don't need the `e ssh' clause:
rsync -avP /home/b /mnt/cygdrive/e/home_bkup/

The biggest practical problem with both of these methods is that you have to remember to transfer every single time. If you work on the project at home, rush to work without transferring, and then work at the office, then reconciling the two versions will be a nightmare. You'll almost certainly lose data, which may make you cry.

option two: versioning systems The idea of a version system is that you have a repository somewhere which holds the project. When you want to work on it, you check it out, and then check it back in when you're done. Once checked in, your copy is entirely disposable.

Using a versioning system changes your mindset. You can screw with your copy of the project all you want, since it's just a copy; you don't have to save lots of revisions, since the system is doing that for you; you'll find your work will be more structured around work sessions which have a specific goal. It's fun.

Finally, it solves the problem of forgetting to bring to work changes you made at home, since it will do its best to merge two modified versions without losing anything. This sometimes requires human assistance (it'll tell you when), but you never lose work.

The standard revision control system is CVS, which (if you're using Cygwin or anything unixy) is on your hard drive now. CVS has been replaced by subversion, but subversion isn't yet common, in the sense of any given computer basically being guaranteed to have it. If you're reading this more than a year from now, try that first.

Once you set it up (RTFM), the only commands you'll ever need are
cvs get my_project
cvs ci (aka cvs commit)
cvs update
Oh, and when you add a new file to your project, don't forget
cvs add new_file ;
forgetting that is basically the only way to lose data with CVS. If you ever need an old copy, or to do weird things to your data, RTFM, but it's very easy. There are also graphical shells like tk_cvs, which I have on one or two machines but never really bother with, since 97% of the work is the above four commands.

Oh, and don't forget to back up your CVS directory from time to time. Mine crashed once. I managed to not cry but I was grinding my teeth in my sleep a lot after that. You'll also have to decide where to put the repository. If you're a student at a university or a not-Microsoft-dominated office, then you probably have a shell account where you can put it. It's surprisingly small, so a spiderhosts account will do just fine for the rest of you. Unless you're doing something screwy, your CVS repository will be under a few dozen MB, so a keychain drive will also work.

[Tip on keychain USB drives: avoid puffy ones. I saw a girl almost weep when the only copy of her presentation was on a USB drive which didn't work because the plastic casing got in the way on the laptop she was trying to plug in to. She was also silly for not having backup: when traveling to do the Big Presentation, leave a copy online, on your keychain, and on a burned CD; chances are that one of them will work, but you won't know which until you get there]

CVS will work great for that part of your life which is project based, but you'll have to go back to option one to handle your MP3s and family photos. I think you're getting the picture here: the project you're working on these days should probably be in CVS, and then you can carry around a hard drive with all of the not-so-frequently-changing stuff that makes your home directory a home.

[link][a comment]

on Thursday, October 28th, Paul said

Mac people fortunately do have an easy-to-get-to home directory with OS X, which you'll find under Users at the top of your hard drive. It'll look like--this being a Mac--a cute little house with your user name (like mine's phughes).

Backup can be really, really easy and fast. Just use an external drive (~$100) and this great app called Synk: Synk makes it easy to only back up things that have changed in your home directory since the last time you backed up, and the super-simple documentation walks you through how to do this.

That means the first time you backup will take a while, but after that, Synk is only saving new files and archiving things you've trashed. For me, the new files are mostly Word files, .pdfs and .ppts etc. from clients, and new music, so over the Firewire hard-drive, backup usu. takes me about 2 minutes. I backup once a week because it's so painless.

It's been said a million times: There are two kinds of people--those who have already lost data and those who will. It's true, although you might not believe it until you're on the verge of weeping after losing your complete e-mail archive from 1992 to date.

And yes, clearly "internets" shouldn't take an initial cap. Phew! Glad to have that settled.

Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:

16 March 05. Complementing your stats package with SQL

The basic principle behind Apophenia is that data should be kept in a database until it's needed, and then just enough pulled out for analysis. There are some things that are much easier to do in SQL, and there are some things that are much easier in a matrix-oriented programming language, and knowing what to do in which context can save hours.

Figure One: Just meant for each other.

Things which are easier in a database

¤ The mega-dataset, which asked every respondent eight thousand questions, of which you're going to use six. You need to read the whole file to get the data you need, but your computer doesn't have enough memory to hold all 800MB of data. So INSERT every line into the database, which will get written to disk, then select out the little bit you need. Your little laptop won't even break a sweat.

By the way, don't forget: if you run regressions on all 8,000 variables, you're a bad person.

¤ Anything involving more than two dimensions. This is really what databases are designed for. You've got one data set which relates cholesterol levels to smoking, and another which relates smoking rates to frequency of getting laid, and you want to show the correlation between high cholesterol and multiple sexual partners. Such merging of data sets is a basic operation in database land, and a total pain in matrix terms. [Select t1.*, t2.* from cholesterol_data t1, sex_data t2 where t1.smoking_rate == t2.smoking_rate. And yes, I know this is wholly spurious statistics; it's an attempt at humor.]

¤ Aggregation. Say you have a few observations of income (between one and ten, maybe) for every ZIP code, and you want the average per ZIP code. Again, a total pain via for loops through a matrix but one line in SQL: select zip, average(income) from data_table group by zip. SQL is limited by not having any real aggregation functions outside of average(), sum(), and and count(), but that's all you'll need 90% of the time anyway. [You can still do weighted sums by things like sum(income * weight).]

¤ Subsetting. Some languages (matlab, octave, R) do a good job of this, but others (Stata, I'm told) have no really easy way to pull subsets of the data a la select * from table where X*Y>.5. As you can see from the example, SQL is built to make this stuff trivial.

Things which are easier in a matrix-oriented program

¤ The actual math, the regressions and MLEs and such, are not gonna happen in the database, so after you've done all the data-shunting in the database, you'll have to pull it to a matrix for the final analysis. As above, the most math you can do in a database is basic arithmetic, and I haven't yet seen a db program which can be extended with user-defined functions. E.g., no reasonable query will attach a Gaussian-distributed random draw to each observation.

¤ Anything in which the data must be ordered, such as producing a CDF from a PDF. [But time series guys, you can lag in SQL: select (t1.income - t2.income) as diff from data_set t1, data_set t2 where == ( - 1)]

¤ Real live matrices (as opposed to data sets). SQL is an algebra on tables, but its concept of the product is pretty drastically different from the matrix algebra product. Taking the transpose of a database table just makes no sense.

Summary: database methods are not a panacea for anything, but are an excellent complement to matrix-oriented programming languages, because things which are difficult in one are often easy in the other. If you know what is easier in which, you can save a whole lot of your life not having to write little procedures.

Policy implications: read up on SQL (many a tutorial out there). If your favorite stats package doesn't handle database operations, you've got two choices: dump it and get to know something like Apophenia, or get a standalone database program and use both. Write the data-massaging half of your scripts in the database program, then write the analysis in the matrix-handling program. If the two programs are worth anything, they should have command-line and text file reading/writing utilities to facilitate this.

[link][2 comments]

on Saturday, April 12th, Antonio Ramos said


Where can I find your sql tutorial? Many thanks, Antonio.

on Sunday, April 13th, The author said

Maybe have a look at chapter four of _Modeling with Data_, which you can download from its web page. The first 80% of chapter four stands alone, in the sense that you should be able to follow it without reading chapters one through three.

Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:

22 July 05. Take my program---please!

You're sitting there thinking, `gee, I could use a program to do a certain nifty trick.' You go online, ask your favorite search engine for nifty-trick programs, and get a list of two hundred. You click on a few, watch them fill your hard drive with crap that you can't identify, and click a variety of setup programs which spew garbage all over a still wider range of hard drive. You find one you like, but then it starts popping up `buy my shareware' reminders. Eventually, the whole system crashes, and then you have to dig up lots of CDs to reinstall your office package, your stats package, your graphics package, &c. Six months later, all the cool kids have switched to new software, and you have to start all over to upgrade.

I think we can all agree that finding and installing software sucks, which is where the package managers come in.

Here's an oft-overlooked detail about free and open source software: it's free, in the sense that it doesn't cost anything. Not only are there no access restrictions keeping you from getting a program, but people veritably _push_ the stuff on you. I only partially understand the motivations, but the result has been a host of ways to get free software, all of which are extremely easy, because somebody went through great trouble to facilitate your free-riding.

Thanks, guys.

The idea behind all of the below is the package: a single unit of program with a self-executing installation program. There's a repository somewhere, run by the e-philanthropists of the world (usually universities like UNC) which has the packages; you have on your computer a package manager which queries the server for what's available and gives you a list. You search for nifty-trick program, click on it, it installs. Six months later, the package manager works out that nifty-trick has a new version out; you click update and suddenly you have more features.

No need to involve Google and sift and sift some more; no need to go through yet another installation program which requires endless clicking of the OK button only to find out that it's not quite what you wanted. Getting a new program takes about a minute of your attention and zero dollars, instead of hours of your life and who knows how much of your cash.

The systems
At this point, those readers who don't run a system with a package manager are hopefully thinking, `gee, I'm missing out.' So, here are some package systems which you may want to consider.

Windows people
Cygwin, which gives you a UNIX subsystem, including the X Window system, and loads of packages. The setup is package-based; just re-run the thing to pull more packages. I'm pretty sure the packages are basically RPMs (see below); you can also install some RPMs directly.

Fink, which is a set of ports from Debian. Since all of OS X is a UNIX system, it is reasonably seamless to interoperate with your other OS X programs.

All linux distributions run a package manager of some sort, so it's just a question of which. The two competing styles are are the Debian APT packages, originally set up by Deb and Ian at Perdue University (I wonder if they're still together), and the Red Hat Package Manager (RPM) system. Those in the know tell us that the APT system is better written; my own experience to date with APT has been better as well. I'm using Ubuntu's distribution right now. But RPMs have improved over the years, and may work fine for you.

Another option, which I'm running on my badass number-cruncher, is Gentoo's portage. It downloads source code, and uses the GNU's autoconf system to work out how to compile the program from source on your computer. Very cool and very automated. Unfortunately, the installation system for Gentoo itself is currently underautomated, so it's only useful for things like dedicated badass number-crunching machines.

[The GNU autoconf system, by the way, is an absolute revolution in how the world of free computing works. The likelihood that you download something and it compiles and works the first time is exponentially higher if it's built using the autoconf system, and that's what allows everybody to just set a few switches and distribute different versions of the same program for so many systems.]

Why they work
There are a few reasons why this would only work in a unixy system and with free software.

There's the idea of the library, which all computing systems have. [Windows calls it a DLL: dynamically linked library. Microsoft employees refer to the problem of getting multiple DLLs to play nice together as `DLL Hell'.] The library is a bunch of functions to do something like paint pixels on a screen or read MP3s or calculate Fourier transforms. A full-blown program just gathers together lots of library functions and runs them in sequence.
[Really, a program is itself just a function library, with a function named main which will auto-execute.]

The Windows OS and McOS before version X never got the library thing right, for reasons I won't bore you with now. But on a unixy system, multiple versions of a library, and hundreds of `em, can live together in harmony.

When somebody wants to write a nifty program, they don't have to start from scratch, because they have the libraries at hand---which is a key difference from not-free software, where the WordPerfect people and Microsoft Word people wrote a host of identical libraries. Further, the communal libraries get better with time, since the author of each new program likely contributes a function or two to the library itself.

The package concept and the library concept correspond. For those who don't know the system, that's why your package manager insists on first installing a billion things named libwhtvr-0.4 when you try to install a single program.

There are few ways to calculate a Fourier transform, but there will always be differences in æsthetic tastes, so there are often many clicky front-end programs that call these libraries. They'll all coexist nicely, because unixy systems (mostly) get the sharing of libraries right. In short, there's competition where it matters, and cooperation where it would be a waste of time. There are a million papers about how this mix of cooperation and competition is delightfully interesting as an economic phenomenon, but those papers are generally not that great, and you don't care---all that matters is that the system has led to a unified scheme to let you load up your laptop effortlessly.

[Notice that stats packages don't follow this library-and-front-end standard. There are a hundred out there, and most of them write their own computational routines, and all of them write their own stats routines. Which is why the world needs apophenia.]

las quejas
There are a few problems with the package system. Instead of getting your search engine's list of a thousand competing titles of varying quality, you often get only one or two (which call the same libraries). The GNU postscript rendering library messes up landscape views, and will do the same if your front end is ghostview or the GNOME PDF viewer. The GIMP is the only full-scale image manipulation program in the free and packaged world, and if you don't like it, it's tough cookies for you. The converse is that if you do, a lot of people have all put their energies into it, and it gets better literally every day.

The package system is generally patriarchal, in that you're depending on the people who packaged the stuff to get around to it; e.g., I've been waiting for Gentoo's people to package GCC-4.0 for months now. Building stuff outside of the package manager is OK, but often causes problems, since dependencies get messed up, blah blah. [I couldn't do it with GCC-4]. A full upgrade when the base libraries like libstdc and the kernel change significantly is spotty---in my own experience, RPMs are the worst with this.

[And autoconf is a pain to work with on the developer side. Having attempted to autoconfiscate my own stuff, the documentation is hard to work through, and if your program worked with autoconf 1.7, you may have to rewrite all of your scripts to get them to work with autoconf 1.9. The price we pay for magic.]

So, there's the joy of the package system. Try it out.
[link][no comments]

Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:

08 January 06. Why word is a terrible program

I was writing a blog entry about proselytizing, which I dislike and may not post, but one point that came up is that the only thing that I actively proselytize, the only thing that I really want other people to do differently, is to use a semantically-oriented document preparation system.

Yes, I realize this makes me a geek, and that maybe world peace should be higher on the list. But I've spent so much time watching people do horribly inefficient things for hours on end, and it pains me. It does.

So, I've written an article about it. Here it is: Why Word is a terible program, and the PDF version. It's the aggregation of a number of past entries, and is in no way about the politics of software. And why was it so easy to splice together such a volume of text? Because I didn't use Word to do it.

Oh, how cathartic. Now I can get back to writing grant proposals with a clearer conscience that I've done something to help the world.

[link][a comment]

on Saturday, March 15th, Andre said

Use each tool for what it is meant for. Word is great for quick office memos done by people with very vague idea about bibliography and typesetting who would have more fun spending Friday night out rather than preparing long scientific articles in LaTeX.

Also IMHO if you have to change style manually through all the document instead of using automated find/replace method, then you'd better change the title of your research from 'Why Word is a terrible program' to 'What I would like to know about Word'.

Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:

06 May 06. The schism

[PDF version]

Those of you who actually read my posts about efficient computing, rather than just going to read the comics at the first sight of the word `computing', may by now have noticed a few patterns.

The most basic is that standards are important. I know this sounds obvious to you, but if it's so obvious, why do people get it wrong so darn often. Why are people constantly modifying and violating standards that work just fine?

I know many of you have suspected this for a while, but let me state it loud and clear: I am conservative. Rabidly conservative. I think that people need to have a really good reason for not conforming to technical standards, and I think most people don't--they just use the shiniest thing available. A large amount of my writing on technical matters is simply pointing out that well-thought-out technical standards tend to work better than the newest and shiniest, and that the value of stability often more than makes up for any flaws in the standards. Even my work on patents is aimed at making sure that open standards remain open and free to implement.

I originally tried to make this into an essay about both computing standards and general customs, but over the course of writing it, I came to realize that the two are fundamentally different. If somebody doesn't quite conform to your human customs--if they use the wrong fork or speak non-native English or wear ratty t-shirts to the office--then the person will be funny or diverse or annoying or just normal. Meanwhile, if computing standards aren't followed--if somebody gets sick of C's array notation, array[i][j], and decides it looks nicer as array[i, j]--then their writing is 100% gibberish and they might as well be speaking Hindu to an English-speaker. Standards-breaking in social settings can be fun; standards-breaking in computing is just breaking things.

So although I usually try to put something in the technical essays that will be interesting to those who could care less about machinery, I don't think any of the below is truly applicable to social norms. Or you can read on and decide for yourself.

Nor is this a comprehensive essay on standards drift and revolution, because that would take a volume or two. Just file this one as assorted notes on one question with an interesting proposed solution: what to do with all those people who keep trying to revise and update and modify the standards?

Intuitively, there's the English-teacher approach, where we force everybody to stay in line with the basic standard. When you go home to write your pals, your English teacher instructed you, be sure to use perfect grammar at all times.

But another approach is to let the whippersnappers fork. On the face of it, it may seem contradictory to think that splitting a standard in half would somehow make it purer, but under the right conditions, it can be the best approach.

For any technological realm, you've got one set of people who just want features--lots and lots of features, enough to wallow in like they're a bed of slightly moist hundred dollar bills--and you've got another team that wants fewer moving parts, and takes care to maintain discipline and stick to the existing norms. We can bind the two teams together, in which case they will constantly be fighting over little modifications to the system and neither team will be happy. That's what happens with English. Or you can have the schism.

Allow me to cut and paste from Amazon:

The C Programming Language by Brian W. Kernighan, Dennis M. Ritchie
274 pages
Publisher: Prentice Hall PTR; 2nd edition (March 22, 1988) Sales Rank, paperback: #4,457 Sales Rank, hardcover: #445,546

First edition 228pp, 1978: Sales Rank, paperback: #60,113

The C++ Programming Language by Bjarne Stroustrup
911 pages
Publisher: Addison-Wesley Professional; 3rd edition (February 15, 2000) Sales Rank, paperback: #11,797 Sales Rank, hardcover: #6,215

First edition Sales Rank, paperback: #1,243,918

Things we conclude: C++ is much more complex than C--274pp v 911pp. C++ keeps evolving: from 1986 to 2000, the book has had three editions, over which it has tripled in size. People are still buying the 1978 edition of K&R C because it's still correct; the first edition of Stroustrup is so incompatible with current C++ that people can't give it away. Finally, Prentice-Hall really needs to lower the price on the hardcover edition of K&R. I mean, my book is selling better than their hardcover, which ain't right.

Meanwhile, C is as stable as can be. Cyndi Lauper has put out seven albums since K&R C came out. The changes from first to 2nd ed. of K&R are pretty small--literally, they're a fine print appendix. And, I contend here, it owes its immese stability to Bjarne Stroustrup. With Bjarne putting out a new version of C++ every few years that frolics along with still more features, Prentice-Hall is free to reprint the same version of the C book without people whinging about how it's missing discussion of mutable virtual object templates. The guys who want simplicity and stability buy K&R and the guys who want niftiness and fun features buy Stroustrup and everybody's happy.

The other technical standard I use heavily is TEX, and I'd been meaning, for the sake of full disclosure, to give a critique of TEXcomparable to this here critique of Word Fortunately, Mr. Nelson Beebe already did it for me, in this (PDF) essay entitled 25 Years of TeX and Metafont. The article alludes to exactly the sort of schism in typesetting as in general programming: you've got the people who are totally ignorant of standards and just want the shiniest new thing, and the people who built a standard system that has been stable for the better part of 25 years. Since he's on the standards-oriented team, he gives many examples of how such stability has led to large-scale projects that have significantly helped humanity.

His discussion of its limitations is interesting because there really are features that need to be added to TEX--notably, better support for non-European languages and easier extensibility. But "TEXis quite possibly the most stable and reliable software product of any substantial complexity that has every been written by a human programmer." (p 15) Changing a code base that hasn't seen a bug in fifteen years is not to be taken lightly, so the process raises interesting questions.

So when you read about the raging debate between Blu-ray and HD DVD (I'm rooting for the one that isn't an acronym), don't think `oh, now I have to worry about all my stuff being obsolete'. Thank those guys for distracting attention from DVD, which is a nice, stable format that hasn't changed in a decade, ensuring that your stuff has not become obsolete. People have made haphazard attempts to revise the CD format, but thanks to distractions like the MiniDisc and even DVD, your copy of Cyndi Lauper's first album is still the cutting-edge CD standard (specified in The Red Book, 1980), while attempts to subvert the CD standard never took off. Remember CD+G? If so, you're the only one.

So how do conservatives evolve? Are we trapped in using standards from the 70s forever more? Of course not. But the evolution is not from clean standards to floundering in pits of features, but revolutionary breaks from old clean standards to new clean standards. The feature pits are just distractions.

The process of evolution via incremental fixes to follow the trends has an unimpressive track record. Corporate-sponsored standards often suffer this failing (but not always), because setting standards that last for two decades and selling frequent updates are hard to reconcile. One company spent a while there naming its document standards with a year--standard '98, standard 2000, et cetera--which in my book means none of the formats are actually standard. The right way is to ride a system until it really doesn't do what you need anymore, and then revolt, building a new one that is clearly distinguished from the old, as we saw with DVD's overthrow of CD because CDs truly can not store movies, or Ω 's eventual overthrow of TEXbecause TEXtruly can not typeset Tamil.

The trick is to know when to revolt. When is a new trick so valuable that the old system should be abandoned? Many a dissertation has been written on this one, and I ain't gonna answer it here. But for well-thought-out technical standards, it's much later than you think, as demonstrated by the active 25-year old standards above.

[link][a comment]

on Saturday, June 17th, Quiznos said

I love the comment about Cyndi Lauper; a complete and absolute non-sequiter if ever I saw one!!!

Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:

18 October 06. How to pick a computing language

[PDF version]

I can not stand how much debate there is about computing languages. I hate the fact that the Web is filled with it, I hate the fact that so many postings on Usenet in the way of `I need a Matlab routine to...' get replies like `Why aren't you using Algol!?', and I hate that my own work is so often evaluated based on choice of computing platform rather than actual output.

So, in my little effort for world peace, here are my notes on picking a language. I'll generalize this a bit next time, but the basic theme is that there is no One True Way. The process of picking a language is picking which is the least annoying trade-off, over the course of a series of many trade-offs.

The moral here is that, even though you will no doubt have a preference on one side or the other with all of the debates below, there is indeed a sensible other side, that other people prefer. That is, there are languages on both sides of any debate because there are valid reasons, both in terms of æsthetic preference and in terms of practical issues, for picking both options. Anybody who tells you otherwise, for example insisting that we must all use dynamically-typed languages from now on, is just being an ass. Pick the least annoying paradigm for yourself, and let your neighbors pick the least annoying paradigm for themselves. We'll all get our work done in the end.

Having ranted, here's a list of a dozen primary axes along which general-purpose computing languages differ. Work out which side you prefer, find a language that is on the same side as you, and go code.

1. Are the libraries what you need?
The primary joy in using an existing computing language is the strong hope that somebody has already written the code you need. However, I have never seen a language that really has a good library for everything I've wanted. I see the big schism as languages with lots of libraries for numerical and otherwise number-crunching routines versus languages with lots of support for Web handling, though you will no doubt find your own divisions among what type of code base supports what languages.

2. Does it assign types dynamically or statically?
The evangelists here all seem to be on the dynamic-typing side, but the issue is more muddled than a one-side-or-the-other split, and neither extreme is great.

If a system is enthusiastic about dynamically typing, then odd problems crop up. What if the first piece of text input happens to be "14"--will your system then assume that the input is all integers, and then crash when the next input is "fifteen"? Say you are trying to build a list of lists. Start with an empty list, then add the first list, then add the second list, and so on. The interpreter may cast an empty list to NULL, and may flatten a list of a single list to just a list--and if the first list happens to have no elements or one element, there may be still more casting until you're left with no list at all. With experience, you'll figure out what tricks you need to fix the problem, and will see it as no big deal.

With static typing, the annoyance is that you'll need to explicitly re-cast variables from time to time. For many but not all static languages, you will need to declare the type at first use, which is also potentially annoying--just as playing guess-the-type can be with a language that doesn't allow you to just come out and declare a variable's type.

Anyway, the question is not dynamic versus static, but how dynamic the language is. What is its list of auto-casts and what is its list of casts that you have to make yourself, and how well does that list fit your expectations?

3. What are the scoping rules?
Every language has different rules for when a variable is in or out of scope. One paradigm is the location of a variable in a file or a directory path. E.g.: in C, a file-global variable can be read by every function beneath it in the file; or in Matlab/Octave, a function is accessible if its .do file is in a specified path. The other paradigm is to define objects, and say that most variables have scope only within the object or its friends. In the first paradigm, files or portions of files can be read in (included) in other files so that scope is somehow shared; in the second, there is a syntax to say that one object inherits another.

The intent in both cases is to allow modularity and encapsulation, wherein one unit represents some self-contained concept and doesn't interfere with other self-contained concepts. Every useful language in existence has this concept, and every one of `em does it in a slightly different manner. I.e., the choice is the epitome of personal taste.

Evangelists here tend to be on the object-oriented side. But the scope-by-file method manages the same encapsulation and flexibility.

4. Is it too verbose or terse?
There is general agreement that too much verbosity in code is a bad thing.

But too little verbosity can also be bad. Coding textbooks often show off one-line programs that calculate the GDP per capita of Bangladesh based on the price of toothbrushes, with fourteen intermediate steps neatly cascading on a single line. This looks very impressive, and looks much more pleasing than the half-page version with all the intermediate mess on fourteen separate lines. It's certainly fun to write one-liners, but guess which routine is easier to debug. Hint: most of your time debugging is spent tracing a process through its intermediate steps.

Some languages are very good at hiding intermediate cruft--so good that you have no idea of what's going on. Some languages are terrible at hiding anything, and require that you always go through every step yourself. There's a practical and æsthetic balance to be drawn here; let's not pretend that terse code is always superior to more verbose code.

5. How does it handle aliases?
Let us say that you want to put a box at the front door so anybody can leave things or take things as necessary. For this to work, you have to tell everybody where to find the box. In computing land, you are not assigning to var1 and var2 the contents of the box, but its location.

So there needs to be two types of assignment: one that goes to the box and copies its contents into var1, and one that indicates that var1 is hereby an alias for the box itself. This is an inherently confusing concept, because the two types of assignment have so much in common. But gosh golly, you've gotta have it.

C and friends use variables and pointers to variables. Some languages use immediate and lazy evaluation for this. Some languages always assume that you mean aliasing unless forced otherwise. And some languages, which I consider to be too limited for anything more than teaching, eliminate confusion by disallowing aliasing entirely.

6. Call-by-what?
How are arguments passed in to subfunctions? Typically, languages do this by passing the value itself or by passing an alias for the value. Again, I'd consider a system to be braindead if it doesn't allow both, but, due to speed issues, the ability to pass aliases is the one that really shouldn't be passed up.

7. How does the language pass functions?
It would be nice to write a function that takes a function as an argument, such as apply(fn, list), that replaces every element of the list with fn(element). More generally, there are many reasons for passing functions as arguments to other functions. Some languages encourage this by treating functions like any other data, Lisp being the paradigmatic example. Others make it not-so-easy, like C, which lets you pass function pointers, but has no syntax for writing in-line minifunctions. There exist systems where functions can't be passed at all, which I again recommend throwing in the waste bin. But among the mainstream options, it's a question of how convenient you want function-passing to be--is this something you expect to use every other line, or something that is good to have handy when necessary?

8. Does it let you hang yourself?
This one is self-descriptive. The best example would be in the dynamic/static type issue. If you have a function that acts on text strings, and you give it the number 14, should it automatically cast it to "14" or give you a gentle `I'm sorry Dave, I can't do that'? C is famous for not stopping you when you access element fifteen of a fourteen element list; some hate it for that and some rely on it heavily.

Every system glosses over errors in some directions and refuses to act in other directions, so the question here should really be: in what ways does the system let you hang yourself and when does it stop you, and are those the constraints that will help or frustrate you?

9. Is it fast or is it easy?
You can't have both, though every proselytizer advertises that their fave has achieved such a miracle. If you want something that is superpaternalistic and takes care of everything without your thinking about it, then you're asking the processor to do a lot of work. For example, if a system is enthusiastically dynamically typed, then the system will check the type of the variable at every single use; if you have a vector of a million data points, that's a lot of overhead. If your system thinks you're too dumb to understand aliases or call-by-reference, then it will copy the contents of the box where other systems just point to the box, again creating overhead. As I've noted before, all that ease and convenience can mean not just a ten or twenty percent speed drag, but a slowdown of about fifty times.

Joel the Guru has complained about people who just assume away all computational issues, and recently berated an evangelist who insisted on not caring. You're allowed to pick user friendliness over speed, but you have to acknowledge that you're making that decision.

10. Does it lean toward ease of use or ease of initial use?
Ease of initial use means that a new user can intuitively guess at what needs to be done with few errors. To achieve this, there are usually restrictions in place that prevent the user from doing the wrong thing, details are elided, and lots of in-place documentation produced. Ease of long-term use means that the user has few restrictions, can tweak as many detailed as desired, and is not frequently interrupted. Most systems focus on either making the details easily accessible or on hiding them; few if any do a truly good job of guiding early users and at the same time getting out of the advanced user's way.

Evangelists miss this trade-off all the time, and pretend that because it is so easy to type print 'Hello, world' it must be that the system will always be easy to use. This is a condescending belief that users are unable to learn and adapt. Lisp coders have reason to brag about their great effectiveness even though their code is just a pile of parens to outsiders, and C coders who really understand pointers do things that are impossible in other languages, and all those guys have valid reasons for why Visual Basic isn't working for them. Let's not pretend that everybody's idea of ease of use is identical to that of a dilettante who will never go much further than printing `Hello, world'.

Anyway, in terms of picking the language, you can ask yourself if you will be using this darn system for the rest of your life, or are hoping to just do a one-off project that will not grow into a big mess, and pick a language accordingly.

11. Does its fluff seem useful to me?
Perl handles regular expressions as part of the language, which provides some nifty text-handling that is much more ornery in other languages. Some languages have built-in databases, or hashes, or lists. This may seem more fun and useful to you than calling these things in from a library. Every language has a library to handle regular expressions, interface with databases, and use hashes and lists; the question is only whether they are immediately on hand or up on the shelf, and what you want to have on hand. But if everything is immediately on hand in the form of a quirk in the grammar then you get, well, Perl, which has a very complex grammar and can often be hard to read.

12. Does it use too many parentheses?
Finally, there is the actual visual appeal: is the code filled with parens, tabs, stars? This is the last question on the list because, frankly, you'll get used to it.

So, there you have it. A dozen ways by which different languages distinguish themselves, all of which require balance rather than a direct proclamation of The Right Answer.

[link][no comments]

Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:

14 November 06. IT Policy for Organizations

[PDF version]

This paper will provide a few useful points to the managers who are overseeing the managers of information technology. Its intent is to give those who have not spent their lives reading computer manuals an idea of what options exist for IT organization, and the social and business problems that the technologists must overcome.

Business v academia
One could think of two paradigms in how an IT department is organized. The first is the business-oriented system. At an organization with such an IT department, every desktop has the same software, which is typically installed from a central server at the IT department. The IT department focuses its expertise upon this list of programs. All users must log in to the network, and all network activity is monitored and logged. Security is a very high priority.

One finds the other paradigm at many academic institutions, where students log in with virus-laden laptops that attempt to bring down the network without the student's knowledge. Ornery professors bring their 1985 copy of WordStar from home and insist that the IT department provide support. Network activity is generally logged, but everyone in the computer science department knows how to get around the logging system.

One may expect that the business-oriented IT is far-and-away more stable, but the reality is that the two paradigms are head-to-head. In the last few years, I have worked at two academic departments and two business-oriented organizations. Both academic departments had one (1) full time employee running the entire system, and suffered major failures at the rate of about once per year. Both business-oriented organizations had a IT staff taking up somewhere between one floor and one building. One organization was an IT mess where little worked, and the public-facing web page was filled with technical glitches and broken links. The other department suffered failures at the rate of about once per month.

Which leads us to the central question of this essay: with so much less effort put into security and stability, why don't universities have significantly more security and stability problems?

The remainder of this essay covers a series of small topics that address this question, including both business, social, and technical reasons. The summary: academics keep it simple, in a way that business users typically do not. There are forces familiar to any businessperson, that push business systems toward complexity, that non-IT management needs to guard against.

I will try to avoid the details of software, but it is worth noting that the business-oriented users tend to run the Windows operating system, while academic-type users tend to run a POSIX system. [UNIX is a trademark of AT&T, so POSIX refers to any UNIX-like system.] This is not a hard-and-fast division, but you will see that each type of software facilitates its matching paradigm.

The division of labor
Corporations are often divided into “Battlin' business units” (as a comic by Scott Adams describes them). Each unit has an internal budget that it hopes to maximize, by billing other departments as much as possible while minimizing the list of tasks with which the department will dirty its hands.

Thank goodness the building services department is not organized like this. Imagine how unpleasant a workplace would be if a department was billed every time a radiator broke in mid-winter. The building services department, knowing that it has a full monopoly on radiator-fixing, could charge what it chose to, and perhaps the local department manager would give up and just leave the radiator broken. Maybe she would buy a few space heaters. Perhaps building services has already stated that radiators are not in the scope of things they are equipped to fix. Providing a vital service via a budget-maximizing, monopolistic department creates abundant opportunity for gaming on behalf of the monopolist.

Rather than allowing building services to define its list of services it will provide, it is typically given a broad mandate: keep the building in good condition. The details of what that means is left to evolve. On the consumption side, no one is ever billed for maintenance services.

This is how the academic IT department works. Their mandate is to keep the network working and the desktop PCs in decent working order. Some departments bill per computer, but this typically means a per-head charge rather than a per-service charge.

Conversely, many business-oriented IT departments bill per service or software item used, and carefully select the services they are willing to put on that price list. This is all entirely natural, and is exactly what is expected of them under the paradigm of the budget-maximizing business unit.

Having established that they will bill for services, what will the IT department offer? As a general rule, the more complex service justifies bigger budgets. That is, complexity goes hand-in-hand with budget-maximizing behavior--and complexity is the worst thing one can have in a computing system.

User expectations
The sad truth is that the job title of every office worker may as well be “computer operator,” since almost all of us spend eight hours a day (about 2,000 hours/year) in front of a computer. Yet many complain bitterly when asked by a manager to spend a few hours learning details of the workings of the machine. IT departments and software authors often concur, stating that systems should be designed so that users can be blissfully ignorant of the machines they use day in, day out.

Imagine this in any other context: a truck driver who does not know basic auto maintenance, an airline pilot who didn't read any flight manuals under the presumption that the controls will be entirely intuitive, a jackhammer that anybody can just pick up and use, a librarian who doesn't learn the cataloging system under the presumption that if it's not immediately obvious then the system is broken.

Intuitive is good, and a system that works against intuition needs to be fixed. Casual users (the archetypal Aunt Myrtle) will never see a payoff from hours of training. But the office worker of today is a “power user” by the standards of a decade ago, working at a job that vitally depends upon good software. It is absurd to say that the best tool for such a person is always the one that is most immediately intuitive and requires the least learning on the part of the user.

The presumption that users have the right to be ignorant of computer matters bolsters the Battlin' Business Units division of labor. Users who log in to a business-oriented computer see only those programs that they can use without training. The tools needed to do basic maintenance or adjust their system's configuration are for the most part missing. The expense and effort of training is saved, but in return workers can do less and are dependent upon IT for more. By mechanical metaphor, the oil for the jackhammer is kept in a locked box that only the jackhammer administrator can access. Keeping users ignorant and disempowered means that the IT department will never be obsolete, but means that even simple tasks require a call to the IT department.

In the academic approach, IT is more like building services: office workers aren't expected to take out the trash, but they are expected to maintain certain standards of cleanliness. There are usually abundant paper towels and waste bins scattered around to help with this. There is still a division of labor where most of the hard work in upkeep is given to specialists, but every user is expected to maintain partial responsibility and is given the tools to execute that responsibility. In the IT context, that means users are expected to have some level of training in maintaining the tools that they use every day.

The academic IT department aims to minimize users' dependence on support services. Typically, the IT administrator writes up a page explaining basic guidelines for maintaining system health, and users are expected to put out a reasonable effort to follow them. When major spills occur, the IT department is ready to clean up. Some users never read the instructions and never quite catch on. But they know enough to ask somebody nearby how to clean up their mess, and so a lightweight and decentralized support network evolves.

The basic strategy for secure systems is to keep it simple. A server on the Internet hosts a number of services that wait for data to come in, such as a web server or email server. To oversimplify a complex field, for an attacker to remotely break your organization's security, it must find a service that is taking in data and then send malicious data to the service.

If no services are listening, then there is no route to attack. So the basic rule of security is to keep it simple: leave open as few services as possible.

Most academic departments closely adhere to the keep-it-simple rule. Their servers and workstations use a POSIX system that allows control over all network services, and leave open only ports for web, email, and a service known as secure shell (SSH).

Here, I am forced to briefly mention the architecture of the Windows operating system. For its operation, it requires a multitude of services that can not be turned off [Windows Time Service, taking input on port 123; Distributed Component Object Model, taking input from port 135; Microsoft Distributed Transaction Coordinator, taking input from port 3372; many more]. Are these services secure? Only Microsoft knows. So, for those who are daunted by the thought of keeping track of ports, services, and sockets, here is a simple summary for basic system security: don't run Windows. Because its complexity involves leaving open ports and services that users can not secure, it fails to follow the basic principle of keeping it simple.

Transparency means that you know what code you are running. Some argue that the route to security is to keep the programs you are using confidential--security through obscurity. The absurd name of this technique should tip you off to the fact that this method is not well-regarded. We must presume that those who hope to break into a system are smart enough to work out such details.

Unfortunately, many software vendors build their business on obscurity, keeping code that interacts with the outside world under lock and key. In such a situation, the IT department is simply handing its security concerns over to an external vendor, and hoping that that company will provide secure products.

Academia, meanwhile, has developed a number of systems that are entirely transparent--notably Apache to serve web pages, Sendmail to send email, and OpenSSH to serve SSH clients. In fact, the majority of the world's web and email (both academic plus business) is served using these open systems.

The summary: keep it simple. Every additional feature, in the operating system, in programs being run, and even in document formats, could potentially be adding a security hole. Which would be easier for an intruder to attack: a word processor document that only supports text, or a word processor document that can include embedded videos and web applications? Vendors and budget-maximizing IT departments press for these features, but in doing so they create potential security issues.

The reader is well aware that standards are vital for interoperation. For example, the Internet exists because of the wealth of tools that implement the HTML standard in which all web pages are written. But the reader may not know how many sirens call to the IT worker pleading with him to break the standards.

For any standard, there are things that are difficult to do. Tool providers are well-aware of this, and thus provide 100% standards-compliant tools that one could use to write documents that comply to the standard--plus a few nifty features that make things easier. For example, Microsoft provides ActiveX controls that make it easier to write web pages that change based upon conditions such as the user's location or preferences. One could implement a simple web page to get the point across using strict HTML, or use ActiveX to write a page filled with bells and whistles. However, only Microsoft's Internet Explorer (IE) can read ActiveX controls. This is no problem, our programmer reasons, because every Windows computer in the office has IE installed by default.

This is how the siren traps its victims: IE has never existed for a POSIX system, and is no longer supported on Apple systems. Thus, once our poor programmer has a large system in place using ActiveX, it is increasingly costly for the organization to switch to any other system.

The entrapment only gets worse. Let us say that you are confident that your company will never, ever switch away from systems that support ActiveX--that is, Microsoft Windows. Between now and the End of Time, no matter what shows up in the future, you will be a Windows user. Next month, the Microsoft representative will come in to negotiate licenses and upgrades for next year, and since your company has firmly committed in ActiveX code to using Windows until the End of Time, and your company can not operate without information technology, the sales rep can ask any price he wishes. By using ActiveX, you have signed away any and all bargaining power.

The summary: keep it simple. By sticking with standards rather than proprietary extensions, you have a lower risk of creating problems with other users, including both the people in the next department over and the clients who give you money. Standards allow you to keep your options open for future changes in the landscape, rather than wedding your enterprise to a single system.

Politically, the stereotypical businessperson believes in the free market's ability to combine the efforts of free individuals to produce productive and even optimal outcomes, while the stereotypical academic is a socialist who believes in command-and-control central oversight by the well-informed elite. Given these stereotypes, it is ironic that the typical business IT system is a command-and-control system, and the typical academic IT system is a free-for-all.

The academic IT department makes promises like a laissez-faire government: we will make sure that the infrastructure you need--working computers and a working network--are in place and secure. If we have time, we will try to help you with your individual issues. Run whatever software you want, but you should be aware of the risks you take.

The business IT department expands upon this base significantly: we will also watch your behavior on the system, handle all underlying issues of computing so users can remain ignorant of the systems that they work with eight hours a day, and enforce our security policy by making sure that your computer runs only a limited range of programs. We will confer with our vendors to determine what new features users will receive.

They would do better to keep it simple. Every additional promise and demand by the business IT departments is one more potential security flaw and one more moving part waiting to break. The reasons for the over-engineering are typical: the IT department wants to maximize its budget, and every new bell and whistle is a justification for more funding, and the IT department is feeling pressure from vendors who know that the company already has word processors and spreadsheets and must convince it to buy a new one anyway. Empowered users may compete with the centralized IT authority, and so tools are kept out of their hands--but supporting disempowered users requires greater complexity.

The academic approach depends on putting a modest amount of work in the hands of the users, giving them the responsibility of learning about and caring for their most essential tools, and then keeping things as simple as possible in the server room. Some especially inept users get to know the help desk very well, most just go about their business, and the entire system runs with a fraction of the oversight and costs of more command-and-control structures.

[link][no comments]

Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:

18 March 07. Web 2.1

[PDF version]

Some people, expecting a different format in a blog, wonder why all my entries are six page essays. Why don't I ever just talk about what I had for lunch, like a proper blog?

But, in fact, I do--on the RSS feed. If you don't know what RSS is, well, it's time; go check Wikipedia.

We can roughly divide the world's web pages into those that are single-topic and probably not frequently updated, like a course web site or a page on wicker collectibles, and those that are updated on a regular basis, like blogs, The Onion, and the New York Times.

The regularly-updated pages have more-or-less converged to an agreement about look and functionality. They have a three-column layout, the entries are in the center column, which is usually 200 screens tall, and divided into entries, and each entry has a button to click to go to the entry page. You can comment, click various buttons regarding social networking, and so on. You know the drill very well.

A telephone with an RSS button.
Figure One: Yes, that is an RSS button on a telephone.

Let us ask the ten-year question: ten years from now, will people be doing it this way? From my perspective, the answer is a solid no. As in Figure One, there are many other ways to get a single blob of oft-updated information than via a web page in a web browser on a desktop computer. I personally spend most of my time in an RSS reader, and only go to other pages when I need information about the Gamma distribution or wicker collectibles. By the way, I get the impression that most of the individually-run sites about such things have basically been eaten by Wikipedia, for better or for worse. For most of the regular-update sites I read every day, I don't even know what they look like any more.

It's all about that blob of new information. The graphics are nice, but not superurgent. The Digg This! button is frankly more for the benefit of the author, not the reader. As for the comments, I'll get to those below. The link list, search box, and a few other details are nice, but 90% of the time I just want that single new blob of information. So how can that blob be delivered?

RSS is perfect for blob delivery, which is why it is gaining in popularity. And with RSS comes the realization that it doesn't have to be in a web page on a web browser on a desktop computer.

I'm enamored of Internet appliances, not in the sense of smallish computers that read web pages on web browsers, but things like voice-over-IP telephones. Mr DRC of Indianapolis, IN can't stop raving about a device he plugs into his stereo that connects to his wireless network and streams Internet radio stations. During the whole Net Neutrality thing, the anti-neutrality side kept talking about an Internet heart rate monitor that would constantly communicate with the hospital. I don't know if such a thing exists (and if it does, it's certainly not going to need major bandwidth or bandwidth priority), but it's a great example of what could be done.

And yes, I do often read RSS feeds on my telephone.

So the thing that bothers me about the Web 2.0 mini-revolution is the Web part. The fundamental premise of the question “How can we make our web pages more interactive, useful, and fun?” is that there is a web browser involved, and there really doesn't have to be.

So, I've started posting entries to this page's RSS feed with no web page attached. Each such entry is a single blob of information, with no search box, no picture of my beaming face, no linkroll. That's all on this page whenever you need it. The longer essays, which aren't so easy to read on a telephone, are still here in big print so you don't strain your eyes, or PDF format for reading in the bathroom.

Mr AF of Washington, DC points out that there's nowhere to comment on my obnoxious ranting in an RSS feed. But comments are also small blobs of information, so they would fit comfortably on a feed as well. You can post them on your own feeds, or appropriate mechanisms for submitting to somebody else's feed will no doubt avail themselves--there are already many web sites with comment RSS feeds. I.e., it'll work itself out.

Summary: If we're still using a browser ten years from now, it won't look like it does today, and it won't be a central part of our lives, because we'll have a dozen other ways to get small blobs of information from both friends and newspapers. Also, if you haven't subscribed to this page's RSS feed, you'll never know what I had for lunch today.

Having completed that thought, let me swing to a brief editorial on the use of Extensible Markup Language, XML. You see those three letters often, and they are often given Messianic attributes, so let me take a minute and discuss the format in further detail.

XML is intended to intersperse data with metadata, such as specifying that the text “Web 2.1” is a title, and the text “Eric B Blair” is an author. Are there other ways to do this? Sure. A myriad. XML just happens to be relatively easy for a computer to read and write.

Incidentally, many alternatives are easier for humans to write, because XML requires a lot of redundancy and odd situations where you need to replace < with &lt; and so on. The XML validation process is hard. Me, I write everything in LATEX and use latex2html to produce the web page. This has proven to be much more human-friendly, plus you get the PDF version for free.

An XML document really has two parts: the document itself, and a document type definition (DTD) which is basically a table listing what tags are valid. But even this is not sufficient to parse a document. Now that I know that xls:wkdy or whatever means that the forthcoming number from zero to six is a weekday in a spreadsheet, what do I do with it? Is zero Monday, Sunday, or Thursday?

That is, the hard part of developing a standard is not the part about reading and writing text, but the information itself. As for the Weekdays, Microsoft has famously botched this one; scroll down to their zany specification in the blockqoute on this OpenOffice-oriented blog.

All of which is to say that the development of interesting standards is great, because it will lead to lots of fun Internet appliances where the sender and receiver have both agreed on a set of metadata tags and how to handle data in each given format. But when somebody says “We are using XML, so we're standards-compliant,” you can safely ignore them--they're writing data down right but have said nothing about the hard parts of writing a standard.

[link][a comment]

on Monday, March 19th, spoofy said

I just sent this page to a delighted coworker.. "RSS feeds, what are they?" And minutes later, upon explanation, "Why isn't everything available via RSS?" Oh technology...the infinite wonder... the reverse solidus.

Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:

29 October 07. Suggestions for the ISO/IEC C committee

[PDF version]

I'm the most fanatical advocate of the C programming language I know. I wrote a darn book about how it should be applied to places where scripting languages are more commonly used.

So, as one who has written way too much on why everybody else's language sucks, here are some notes on how I think C could be improved. Beyond a few security tweaks, there isn't really any pressing for evolving C beyond the 1999 standard, but here's my wish list anyway, in case another round of reform should ever happen. As above, it's primarily based on a desire to learn from some scripting languages regarding syntactic conveniences that are almost implementable in C, but not quite. And as a wish list, I have no idea what debates have been had before about these issues or the counterarguments about why these will cause havoc (and the ISO/IEC doesn't post rationales for past rejections anywhere that I could find them). I just want them.

The reader will note that none of these suggestions will require rewriting existing code, and none of them require a serious shift from what a C compiler can do now (save for maybe the thing about counting arguments in variadic functions, which is a frickin' security risk and needs to be addressed anyway). The first half are simple wish list items; the second half is about the form of variadic functions, which shows such potential but in its current form is broken.

asprintf needs to be part of the standard
The problem is that any string operation is three steps: measure the length the string is about to be, resize the string, then place the data you'd just measured into the string. You can home-brew convenience functions to do these three steps for most operations like copying or concatenation, but for snprintf, it is basically impossible to do. Everybody keeps reinventing asprintf (GCC and BSD both have it), and it is so much more convenient than snprintf that I consider that function to be basically obsolete. Because string length bugs are such a hacker favorite, the missing asprintf is also a security concern.

-> versus .
I don't remember who pointed this out, but there's no difference between these two. If you use parent.child when you meant parent->child, or vice versa, you just get an error and have to go back and switch to the other one. There is no ambiguity: the syntax should use a period for both. This is already how function pointers are handled, because if you call a pointer-to-function when you meant to call the function itself, the system can do what you unambiguously mean without being all pedantic and demanding that you fix the syntax.

Division by int
In many programming languages, 8/5=1. This is part of the conflict between computer scientists, who really want division by two ints to always return an int, and we humans who expect 8/5=1.6. This should be an optional compiler warning that humans can turn on as desired. The compiler can even do one of those polite things like `suggest parentheses around int divided by int' so when you really need int/int you can have it. [So no, the ISO/IECC folks don't have to get involved.]

Nested functions
All those guys who brag endlessly about how wonderful LISP-type languages can be are really just saying that it's great to have lightweight inline functions. And hey, it is. Maybe they're also bragging about how far you can get without using any state variables, but you can write in a state-variable-free style in C too. The typical modern compiler will even recognize tail recursion and compile accordingly.

GCC already supports nested functions, and it seems like it'd be harder to not support them. After all, the placement of a function is really just a question of what variables are in scope. We already know how to deal with scope where one block is inside another block (like a for loop inside a for loop), so it's a simple application of the block-in-block scoping rules to have functions-in-functions. This made so much sense to me that I didn't even know nested functions aren't part of the standard until somebody told me my code won't compile on MSFT compilers.

As for proper in-line functions, they could maybe be declared in the format used for anonymous structs below, but I don't know if there'd be much benefit. It's already hard to read when people in those super-elegant languages put an anonymous function in the middle of a lengthy function call, and a single function header is blockier in C, what with all those types. Nested functions, such that you can declare the function on one line and use it on the next, will suffice.

A smarter preprocessor
The preprocessor gives you one form for macros: newfunc(a, b, c). We can use regular expressions to do better. For example, some people really prefer 2Darray[i,j] over 2Darray[i][j]--even the first edition of K&R acknowledges this. It'd be pretty simple to specify that the preprocessor, whose sole job is text substitutions, be able to substitute one for the other.

Two other examples I'd immediately implement if somebody gave me the chance: v[[i]] to gsl_vector_get(v, i), and object.process(input) to process(object, input). That second form allows you to have a self or this variable passed to functions inside structs, but without having to deal with the other sixty-one (61) keywords of C++. I've discussed this before. All you need is a simple text substitution, but the preprocessor won't even give us that much.

Better variadic functions: default values

You could almost do this now with an inline struct. Here is code which would be valid if it were in a real program. Instead of passing individual arguments, it passes a single ad hoc struct that holds all the function arguments. Because you can send an anonymous inline struct whose values are set via designated initializers, and whose undeclared values are set to zero, you can simulate named arguments with default values.

typedef struct{
    int top, bottom;
    char *left, *right
} ad_hoc_struct;

int call_me(ad_hoc_struct in){
 if (! = 287;
 if (!in.bottom) in.bottom = 34;

call_me((ad_hoc_struct) {top=12, left="Hi there"});

Of course, this is a syntactic mess, but the fact that you can do this via standard C shows that we're not talking about completely gutting the language, just major cleanup on form.

To outline the work the system would need to do: if there's an = in the function declaration, the compiler would need to declare a struct with the same scope as the function, and would need to parse the call through the struct. Because it is currently not valid to have an = in a function declaration, this does not interfere with existing code.

I bet somebody could write this as a preprocessor script. That would be neat.

Better variadic functions: argument count
My impression is that the variadic function setup is exactly sufficient to implement printf, and no more. I feel like I'm on a tightrope without a net every time I use a variadic function because there is no way to know the type or number of arguments. The system knows at compilation, but the standard obligates it to just throw that info out.

For the number of arguments, it'd be cheap and easy for the system to simply count them and send in a phantom variable with a name like _nargs (or such) that indicates the count. This would be a cheap way to implement default values as well: in the example above, if _nargs < 2, then let bottom=34, or such.

Oh, and it's entirely '70s that we need at least one fixed-type argument (and is once again exactly what you need for printf but annoying in many other contexts). That'd be obsolete if we got the number of args sent in.

Better variadic functions: type safety
Also, variadic functions throw the type checking mechanism out the window. For the printf family, there's a sufficiently set means of dealing with the variable arguments that GCC can check up on you, but for anything else, you're screwed. If you have a function that takes in two long doubles, and you make a call like doublevariadic(in, 1, 0), then you're screwed, because the bits that represent the integer 1 look nothing like the bits that represent the long double 1.0. In fact, on a bad day, this will just crash and/or produce a security risk, because reading two long doubles may go past the input data block.

To make this concrete, here's some sample code that crashes and burns. The fix is to replace 2 and 3 with 2. and 3., which shows how subtle are the errors that variadic functions in their current form invite. Note the incredibly convenient C99 use of a declaration inside a for loop header, indicating that the ISO/IEC standards committee is not insensitive to the human desire for grace and convenience.

#include <stdio.h>
#include <stdarg.h>

void vtest(int ct, ...){
  va_list va;
    va_start(va, ct);
    for (int i=0; i<ct; i++) 
        printf("%g\n", va_arg(va, double));

int main(){
    vtest(4, 1., 2, 3, 4.);

This one may be a compiler issue too, because the C standard allows functions to take on compiler-specific __attribute__s. For example, GCC lets you check a function with a printf-style format string against subsequent arguments via the format attribute. Similarly, one could imagine a vartypes(char*) attribute that would tell the compiler to check that all variadic arguments be char*s, for example. That would work for the function above, but wouldn't necessarily be helpful if we're expecting, say, a potentially infinite list of int-double* pairs.

Anyway, variadic functions need reform, because they are the most unsafe part of the language. But they also have a lot of potential, because they could be used to implement the sort of pleasantries like default arguments and named arguments that every modern coolio scripting language has.

[link][no comments]

Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human:

03 September 08. Google OS (aka Chrome)

[PDF version]

OK, Ms ABR of Washington, Columbia asked me to write about Google's new browser, so here goes. I'm typing fast, editing lightly, and posting on an odd-numbered day.

Google's browser is an attempt to shift the position of a long-running search for balance, over where work is to be done. So this discussion of the browser has to start with a brief history of networked computing.

We begin with your mainframes of old, like before we were born old, which often had terminals attached. Terminals, like the terminus of a railroad station, were the end of a line out of the central system, where the end in this case has a screen and a keyboard attached. You would send requests from your little end of the line, they would go to the mainframe, and then it would send results back down the line. Thus, these terminals were called dummy terminals, because they did no thinking, just relaying keyboard presses and displaying the output.

This is why the personal computer revolution was so interesting: you now had terminals that looked like dummy terminals (like the TRS-80) that were capable of doing things on their end of the line. So home users, who had no mainframe to attach to, were increasingly using these little terminals to do independent work that the dummies could never do.

Now, put a mainframe capable of math on one end, and a terminal capable of doing math on the other. The key question for the rest of the essay: who does the processing?

To make this more concrete, jump forward to the Internet age. You type in a web address, and the server sends back a big block of text. That's dummy terminal mode, where your computer is doing minimal thinking. Now say you go to a site with silly Flash or Java games. You go to the site, you get a bar that says `loading' on the screen for a minute, and then you play your game on your screen, without really talking to the server. Now things are reversed: the server just read your request and dumped back data, and your PC does all the work.

Or say you go to Gmail. It has a `loading' bar like a Flash game. But the server is active, because it's trying to find your new mail, starred mail, spam count, and so on. But your PC is active, because it's opening and closing window bits without talking to the server, autocompleting and highlighting things when your mouse is in just the right place, and so on. There's a sweet spot between work on the server side and work on the client side; a lot of people think Google has hit it. No citations today. But try typing 'Google sweet spot' into, uh, a search engine. Me, I think Google has missed it: my email should not need a `loading' bar, but that's just opinion.

Virtual machines
Why not have the client do everything? That's the clear trend, but it's been tried before, and past efforts were not as victorious as hoped. Recall Java, which emerged with much hype the mid-1990s as the way to get networked computing onto our increasingly smart client PCs. In retrospect, we can see Java's failings pretty clearly. First and of least importance, it emerged in the middle of the object-oriented fad of computing, and the language itself went way overboard.

Second, it relied on a virtual machine (VM) that never ran as well as we'd have hoped. Sun promised to write a VM for any device (telephone, Windows box, Linux box) that would handle the guts and details, and then you'd write a program in one language--Java--that runs on all these machines. But the VMs were all a little different: at the least, your telephone has buttons that your PC doesn't and vice versa, so how do you write something that works in both places? But the big virtual machine difference was between Sun's virtual machine and Microsoft's. Microsoft's Java machine was designed to be incompatible, as recorded in a ton of court documents. You'll recall the press about the Microsoft antitrust case, which was mostly about Microsoft killing the Netscape browser, but the real crux of the case was about how the browser carried a Java VM, and Microsoft felt it important to kill the VM.

So once you download a Java program, it might not run. Running from a virtual machine instead of native to the hardware, it might run slowly. And finally, there was the downloading issue: a Java program is too much for a guy living in 1995 with spotty AOL dialup to use without frustration.

But the virtual machine idea was a good one. It's a fabulously attractive idea to have a code-running box that manages all the low-level work, so programmers can do the high-level stuff. It's so fabulous that Microsoft does it: their .NET framework basically allows you to write in any language, then translate it to their .NET machinery to run on a Windows box. This is exactly the abstraction Java did, but .NET is written around Windows machines. The virtual machine idea predates Java. Infocom games, like Zork or the Hitchhiker's Guide to the Galaxy, were data files for a virtual machine. The Infocom VM is easy to rewrite; I could get one for my telephone.

Your browser is a virtual machine. Every browser can read JavaScript (whose code has no discernable relationship to Java--the naming similarity is pure advertising), and can run Flash, and load Java programs. That's why Google's mail program can run on basically any machine as long as you have a browser to interpret its Javascript.

The family tree
One of my favorite things about how modern computers work is the fork/exec model. I won't bother with details, but programs can start other programs. Every process has a parent process (unless the parent died, in which case it's an orphan), and no program can spawn out of nowhere: it needs a parent. This is how the entire thing works, from boot to shutdown: you start with init, then init forks off a new program, say the bash shell. Then the bash shell forks off a browser when you type firefox at the command prompt. Then you open a lot of tabs in Firefox.

The process model gives you stability, because the children are only vaguely related to their parents (mostly via carefully-controlled interprocess communication), and if the parent has issues, then they won't affect the child, and vice versa. It's the operating system's job to make sure that this is the case, and to make sure that the processor gives fair time to every process running, where by `fair time' I mean access to the hard drive, the processor, and other physical resources the OS is taking care of.

So back to Firefox, which does not spawn child processes (to speak of). It's one monolithic blob to the operating system, not a family, so, e.g., if one blob of Javascript fails in one place, then all the others will also be stuck.

Google Chrome is prolific: it is designed to spawn lots of children. For every web page you have open, you should have a separate process. So let's review: you have a Javascript program (aka a web page) in one tab, and that tab is its own process that the operating system treats equally to every other program. Yup, sounds like a standalone virtual machine to me, exactly like the Java VM or Microsoft's .NET.

So Google has taken those last steps to make our typical programming languages of the Web exactly the languages you need to write standalone programs for any operating system. With a few lines of Javascript and HTML, you can write and distribute a standalone Windows program.

Or to put it more directly: the operating system now gives equal treatment to Google Docs and Microsoft Office.

Critique and politics
The Google VM will definitely benefit Google: they've got the lead in programmers who speak the language that their VM speaks. Does that make their browser evil? Maybe, but as evil goes, this is pretty beneficial to everybody (except Microsoft), because another VM choice may allow some fun new applications.

In fact, Google has made their code available under a relatively corporate-friendly member of the family of free software licenses (BSD). Why? Because they don't care about vending VMs, they want to make sure that absolutely everybody has such a VM, so that it's feasible to write for the Google VM rather than for .NET or whatever other toolkits might be hanging around. How getting people to choose Javascript over .NET will turn into $$$ for Google is left as an exercise for the reader.

Oh, here's one hint (along one of several threads): go back to the problem of balancing work on the client and server ends of the cable. If Google gives you software that grabs more processor time on your PC for Google Docs, then it can redesign things so that its servers in California don't have to think so much. Google doesn't have to spend cash on new servers--they just use more processor time on your PC. Google is thinking maybe you can pay the darn electricity bill for once.

Further, mainframes are not particularly smart. From my own experience buying servers for research, the big boxes are designed to push lots of data through a pipe, hold a big database, mesh together into an army of servers, and otherwise handle lots of little requests. But the processor on some servers is identical to the processor on a high-end PC, and ten cheap PCs would easily run circles around one blade of a server. So the only way that Google could feasibly make a million instances of Docs smarter is to push work out to the clients.

As a digression, all this processor-seeking touches upon one of my personal pet peeves: VMs are slow. As I type, I'm waiting for Amarok to add an album to my playlist. This is not something that should require waiting for (op--it's done), but Amarok is written in Ruby, which allowed for all sorts of nifty widgets that would take longer to write from scratch. Hey, just click a performer and pull up their Wikipedia page in your music player, all while you're waiting for their music to actually play. So I'm not sure if we can expect too much richness from Google's new virtual machine, though maybe for once the promises that it'll be better with next year's faster processors will actually come true.

But that's all the critique I've got. Google has taken that last step to turn the web pages of the type in which they specialize into bona fide applications that the operating system treats as such. That's nifty, and means that we can expect our web pages to turn increasingly complex and to increasingly take advantage of the processing power on our end of the cable.

[link][2 comments]

on Thursday, September 4th, AB said

Well thanks for this post Mr. Blair. I do have a few questions: If Chrome really does slow down people's home computers, won't there be a public outcry (and eventual disuse). Also: Has anyone else heard of the IP issues that seem to be surrounding Chrome?

on Thursday, September 4th, the author said

--Your processor has 100% time to allocate to everything running. Before, if you had Word and firefox running two tabs, then Word gets 50% time, Firefox tab 1 gets 25% time, and Firefox tab 2 gets 25% (very, very roughly). If they're all separate processes, each gets 33% time. Overload with the first setup just pushes Firefox to crawl; overload in the second case slows Word down too. [very, very roughly.]

If you care about the performance of those two tabs running Google Calc and Google Reader over Word's performace, then great, you like the additional boost. If you care about Word's performance, have enough tabs open, are sensetive enough to performance issues to notice, the "very, very roughly" part doesn't have too big of an impact, resource allocation is sufficiently consistent that there's a pattern, and you catch that pattern, then maybe you'll be annoyed that the new browser takes more resources from other standalone programs.

--There was a bit of an outcry over some boilerplate language in the user license that all the data that passes through Chrome is the property of Google, we own your passwords, and so on. They retracted it pretty much the next day, and I'm willing to believe that it was really just lawyers overcovering their asses, and not really what Google was after. Anyway, this is on a BSD license, you can pick through for any monitoring code and see how Google is watching, if at all.

Yes, the comment box is tiny; write in a real text editor then just cut and paste here.
If you are a human, type the letter h in the first box.
h for human: