Archive for October, 2008

on Perl

October 7th, 2008

I discovered Perl in the old millennium. I was building websites and I found out that HTML isn't Turing complete. Content and layout was cool, but it didn't really *do* anything. That's when I heard about CGI. I would find Perl scripts on sites that don't even exist anymore, upload them, get the 500 error, stare at the logs, revert any changes I had made, and try again. It wasn't sophisticated, and I wasn't even writing any code, all I wanted was to get the damn thing to work. The fact that Perl was the language made no difference to or fro. It was packed with frustration, though.

The principle of surprise

I've written code in Bash, C, C++, Haskell, Java, Pascal, PHP, Python, Ruby. So I feel like I've been around the block a few times, as far as choosing a language. And yet, Perl leaves me bewildered. One of the pillars of Ruby is something called "the principle of least surprise". What it means is that when you're not sure how to do something in Ruby, and you just do what seems most likely to work, it works. It's a wonderful quality, and it seems to be based on Perl, because Perl is the exact opposite.

Perl smacks horribly of apprenticeship culture. One where the novice is carefully guided through the valley of death, across the bridge over the pit of lava, past the nine-headed monsters, by a veteran monk. Send a tourist out there with a map and he's likely to be sent home in several pieces. Take a look at a common mistakes document and you find an immensity of pitfalls, not just in the language itself, but also from version to version.

This quality isn't merely at the fringes, it permeates the languages as a whole. Take arrays, a fundamental type in any language. Type print @arr and you get LarryWall, the elements of the array printed out. Type print @arr."\n" and you get 2<newline>. Que? Put an array in the context where a scalar is expected and you get the length of the array. Better yet, call a function like this: func(@arr, $arg3). Guess how many arguments the function has? Three. The array has been flattened and concated with $arg3. Yep, auto expanding arrays, ain't it grand? I like automation, but this is too automatic.

It's very common to get an integer where you expect to have a string. One common Perl function is chomp, it removes the trailing whitespace. I was getting an integer until I figured out it's a destructive function. This is another weird quality for a scripting language, you have lots of destructive functions that give integers as return values, as if you're doing syscalls in C.

I was stuck on one problem for hours because the value I was printing never made it to the screen. It turns out it was because of line buffering. So how do you flush the buffer? I looked for a flush function, there isn't one. Nah, you set the variable $|. Should have been obvious, but it wasn't.

Here's another one. Say you set a variable in a config module and read it in a code module. Now you want to rename it, how many places do you have to do that? The answer is four. 1) declaration, 2) use, 3) explicit export in config module with Exporter, 4) explicit import in code module. This is ridiculous.

A bag of hacks

When it comes right down to it, Perl is an amalgamation of hacks into a big, messy package. Granted, many of the syntactic hacks are useful, like qw(). Others are completely incomprehensible, like all the variables called $<nonalpha>. These all mean something: $_, $-, $+, $`, $', all related to regular expressions. $_, though, is used all over Perl to store "the thing you may want to have".

Despite all this syntax, there's none for declaring formal parameters to a function. It's just like Bash, pass in whatever you want, and then you read it out from an array. One common idiom is my ($var1, $var2) = @_. If you only have one variable you might be tempted to drop the parentheses and write my $var = @_, which will give you a fun bug (an integer), because now you're assigning an array to a scalar, not to another array.

Most (modern) languages have set a sane policy on pass-by-value vs pass-by-reference. And there are those with a foot in each camp, making the coder constantly second guess himself (thank you, C++). Perl, predictably, is conflicted about the issue. By and large you pass values around like you're in Bash, but pointers/references exist too. Here's a motivating example.. how do you pass two arrays to a function? Derm. You.. can't. When they come out the other end the arrays have auto-flattened and it's now one big array. So here's how: func(\@arr1, \@arr2). And in the function you say my ($arr1, $arr2) = @_. What the? Bear with me. Now you have two references. To dereference you go: @$arr1[2]. Same goes for %$ (hash) and &$ (func/closure reference). You sort of "wrap" the type being pointed to around the pointer. Needless to say, this syntax debauchery doesn't make code simpler.

It gets better. Perl's support for complex data structures is really... interesting. I was messing with an array of hashes once. And I needed to sort the records by a key in the hash. (Picture a table, click on the column name to sort by it, that sort of thing.) This might sound lame, but it's not obvious how to sort this. I spent hours on google and I found all sorts of examples, averaging about 15 lines. 15 lines to sort an array of hashes? What is this, C? I didn't understand them, I kept looking. Eventually I found an academic looking paper about the issue. It demoed various approaches, concluding with a 4 liner that was supposed to be the best, hurrah! The code is so incredible that I have to show you.

my @sorted =
	map $_->[0] =>
	reverse sort { $a->[1] cmp $b->[1] }
	map [ $_, pack('C1' =>
		$_->{"priority"}) ] => @unsorted;

Let me try to unobfuscate this. There's an array of hashes and we're sorting by the key called "priority", an integer value. In pack, we grab the value of "priority" for that particular hash and make a string of it (wtf). The map just outside pack does this for every hash. So what have we? A list of strings. We now run reverse sort (by decreasing integer value) on this. And the outer map is just a way of saying "take the existing array and replace it with this". I think. I still don't understand the details of this code. But imagine, convert all the keys you want to sort by to a frickin string and sort the strings lexically, then figure out which position every string went into and order the hashes accordingly. It's madness.

(Disclaimer: I was a total Perl noob when I was looking for this code, so maybe I missed something, maybe I explained it wrong.)

Somewhere in this madness someone decided to introduce a little order. It's become a standard for Perl code, you put this verse in your module: use strict. What it does is your code will no longer run (presumably the reason it's not on by default). It adds some static checking, so misspelled variable names no longer fail silently and that sort of thing.

Maddening syntax

Perl not only has syntax for "useful things", it has syntax everywhere. Like in PHP, the $ sign must accompany every variable always. Except if it's an array, then you use my @arr = (elem1, elem2), and my %hash = (key=>"value") for hashes. But then again, when you access and array element you write $arr[1], and same for hashes. So what the heck? You can also declare hashes through a reference, which goes my $data = {key=>"value"}. More syntax, more fun, right? Basically, it's @ for arrays, % for hashes, & for functions, and $ for... everything else.

And, of course, always remember my, because Perl thinks all variables should be global by default (wtf?).

I swear my fingers are more tired from Perl than any other language. There is so much syntax to type. At least it's tolerable when you have vim's tab completion set up. It's not "intelligent", so it doesn't filter out suggestions that don't fit the context, but it's good enough so you never have to type an identifier or keyword twice.

So why Perl?

Despite everything, in the end Perl is not such a horrible language. As the saying goes "you get used to everything", which implies than every language is usable, as you will eventually get used to it. The benchmark, then, should be on how long it takes you. And Perl is awful at first, terrible language to "try stuff out" in. Realize you have to change the type of a variable and because of the silly $/@/% syntax you have to run all around your code to change it. On the other hand, if you know what you need, then it's not as painful. And as I'm finding out, it doesn't take that long to adapt.

The nice thing about Perl is that it's so close to the shell, and a sort of power set of the shell. You have grep built-in, you have regular expressions as native as they get, you have various shell commands included, fast access (not necessarily easy access) to sockets, pipes, all the good stuff. It's nice to have thin wrappers around syscalls, sure as hell beats doing it in C. And heck, I like to break lines with . (concat) when printing strings, and because the ; is required I don't have to stress about it.

Beyond that, Perl has its place in the language ecosystem. Learning Perl is a way to understand Ruby, which is based, in part, on Perl. It's also a way to understand C, which Perl inherits from. That isn't to say using $_ in Ruby is a great idea, but now at least I know what it is when I see it.

And you know the blocks/closures that everyone loves about Ruby? They're from Perl. You can't write them as neatly, it is Perl after all, but I do quite like map { /$device on ([^ ]+)/ } $mount_data;

People obviously have their own criteria for picking a language. I've realized that perhaps the most important thing is how the language lets you manage data (or maybe I'm just saying that because I like Python?). Perl is definitely not a great choice here. It has good string handling, but once you get into multidimensional types (and I haven't even mentioned Perl's "object oriented" features) I run screaming back to Python.

I suppose Perl's chaos can in part be excused on historical grounds. After all, modern languages like Python and Ruby that don't have these problems had the benefit of Perl's example. But then again, Perl isn't the only older language still in use, but it does stand out as the most chaotic.

undvd, now in perl!

October 5th, 2008

So it turns out you can do a whole lot with bash. More than I knew. But when you get to a point where you start hitting the limitations of your language*, it gets frustrating. The biggest problem with bash is that it doesn't have functions. You can wrap a bunch of code and call it with arguments, but it doesn't return a value. I've tried to come up with a hack to emulate functions returning arguments, but in the end there just aren't enough pieces in the box to build it from.

To date, undvd has been using various tricks to get around this. Let the function echo the value and capture this in the caller. But then what if you have a failing condition? Well, you can echo the value to stdout and echo the error to stderr, so it doesn't get captured as the result of the function. And then kill $$ to force an exit (you can't just exit cause that is equivalent to a return from the function).

That kind of works, but eventually you break down when you have to return more than one string that may contain whitespace. Sure, you could quote them, let the caller find both strings based on where the quotes are, then chop off the quotes and voila. But all this just for a function call? It's too much, and it's unacceptable from a maintenance standpoint.

Bash's overall weak support for other features of a typical programming language makes it a challenge to write structured programs. undvd-0.6 is therefore pretty much a dead end from a development standpoint. It works well enough, but it's hard to get anything more out of bash. In order to keep evolving, undvd needs a new language.

Another substantial problem with bash is that you're executing commands in the shell, in other words you build execution strings. There is a lot of potential for quoting bugs when you're dealing with filenames that have spaces and quotes in them. And not just when feeding them as arguments to executables, but on every "function call" just the same. I've spent a lot of time trying to safeguard against this, but all it takes is one instance where the strings aren't quoted correctly, and you have a fatal parsing error.

So it's time to think about porting to another language. A language that is close to the shell. A language that lets you run a subprocess by passing in the arguments as a list, not a string. A language that has basic programming constructs, like functions. That has good string handling. That can do simple floating point arithmetic. That is as widely available as possible. A language like... perl.

It sounds absurd, doesn't it? Porting to perl in the name of maintainability. But when you're in bash and most of what you're doing is string manipulation and calls to other executables, it's the right choice. And I bet you have perl on your box.

Not that it hasn't been fun. Bash was the right place to start, and I've learned a lot of things about bash on the way. I've also learned that you have to do obscene things like echo strings to bc to do simple floating point math.

The port

It's a straight port. undvd 0.7 runs on perl, but the way it was written was to reproduce 0.6 exactly. The code is completely new, obviously, but the functionality is the same.

As a result of running in perl, all the string/numerical processing logic has been internalized, and all the calls to awk, sed, bc and so on are gone. This makes it run faster, scandvd is especially noticeable. This isn't a big impact, since most of the work is done in mencoder, and that is still the same. Nevertheless, it's a welcome side effect. It also makes me happier, since it's less dependent on all these outside tools.

In terms of size the code is about the same. The perl code is actually 5% bigger.

What this means for you

  • 0.6 and 0.7 are functionally equivalent.
  • If you find a bug in 0.6, it's probably also in 0.7.
  • If you find a bug in 0.7, try 0.6.
  • Please report bugs.

* I'm using the term "language" loosely here. I'm talking about both the language, and the implementation, and the execution environment (ie. standard libraries, or in bash's case the gnu userland). Often we just pile all of this under "language", because it's easier to talk about it that way.

showing equations is not teaching

October 3rd, 2008

I'm going to describe something that you know very well, and that you do all the time. I'll describe it algebraically, so that we can keep it somewhat rigorous, like good teaching prescribes. Once I'm done, you'll know exactly what I'm talking about.

  1. c <= C
  2. w = (0 <= c < V)
  3. 0 < d < v
  4. p = {p1, p2, ..., pn}
  5. t = px

Got it? It has a colloquial name: doing laundry. Here's the same thing in words.

  1. grab a subset of the clothes in the laundry basket/hamper
  2. contents of washing machine equal to said clothes, but greater than zero, lesser than washing machine's volume
  3. contents of detergent compartment greater than zero, lesser than its volume
  4. machine has a set of programs
  5. duration of wash determined by chosen program

Here's the thing. If you understand laundry, and you knew that's what the equations were supposed to describe, you could probably figure out what's what. At the very least, you could come up with your own set of equations, and they might be similar enough to infer the original meaning.

But what if you had never heard about laundry, and all you got were these equations. Could you figure it out? No. You're just not that clever.

Now put yourself in the shoes of someone who's teaching laundry. You know laundry inside out, you can derive the equations at will. Laundry is the most obvious and trivial subject as far as you're concerned. Students come to your class, today's topic is laundry. You spend a couple sentences describing laundry. You explain it in words that your students don't understand. Then you present the equations. Then you go to lunch feeling good about yourself, passing on the knowledge and all that.

As it happens, not all the students latched onto the theory of laundry. Some are turning up, asking dumb questions. What is wrong with these people? How can you fail to understand laundry? You'd have to be dense. Geez, the quality of our freshmen really is plummeting. There's no way my generation was so thick.

dynamic or lexical, that is the scope

October 2nd, 2008

Apologies for the awful pun on a 16th century action movie.

Do you know how in the movies, when someone has to testify they first pin his hand on a Bible and make him recite that I swear to the tell the truth, the whole truth, and nothing but the truth, so help me God litany? Presumably, the god they're talking about is the god in the book, that's why the book is there (I bet polytheists find this very helpful). I guess they think it's harder for people to lie after taking a pledge while handling a Bible. (Do we have any statistics on whether that works?)

Anyway, in a dynamic scope, there is witness called Python. He will make his pledge based on the book that they happened to shove under his hand that day. One day it could be the Bible, a week later it could be The Gospel of the Flying Spaghetti Monster. So that means the pledge will be somehow relative to the god in that particular book. Uneasy about one god, very comfortable with another one.

In a lexical scope, there is a witness called Perl. He is very emotional about his first time as a witness. And even though they give him a new book every time, he just can't seem to notice. He makes his pledge based on the very first book they slipped him.

And now for a short digression into the world of programming languages. You have two scopes, one is the innermost, the other is the outer scope. There is a variable in the inner scope, but it's bound in the outer scope. How do you evaluate this variable? There are two answers to this question.

Under dynamic scoping the variable gets the value that it has in the outer scope at the time of evalution. Under lexical scoping the variable gets the value that it has in the outer scope at the time of declaration.

That didn't explain anything, did it? I know, read on.

Who cares?

This is an important question, and people rarely seem to ask it. Functions care. Named functions and unnamed functions like lambdas, blocks, closures (different languages have different names for the same thing). Anything that has a scope and can be passed around, so that's only functions.

So all the blah about lexical scoping really just boils down to one little detail that has to do with how functions are evaluated. Hardly seems worth the effort, does it?

Dynamic, baby

Dynamic scoping is the more intuitive one. You grab the value of the variable that it has when you need to use it. That's what's dynamic about it, today this value, tomorrow another.

Consider this Python program. It prints the same string in three different colors. output is the function responsible for the actual printing. output has an inner helper function colorize, which performs the markup with shell escapes. Now, since colorize is defined in the scope of output, we can just reuse those bindings. I pass the string explicitly, but I don't bother passing the color index variable. (A variable gets interpolated where there is an %s).

def output(color_id, s):

    def colorize(s):
        return "\033[1;3%sm%s\033[0m" % (color_id, s)

    print colorize(s)


for e in range(1, 4):
    output(e, "sudo make me a sandwich")

Lexical ftw

Lexiwhat? If you recall from last time, "lexical" is a pretentious way of saying "where it was written".

What this implies is that the outer binding is evaluated only the first time. After that, whatever scope the function finds itself being evaluated in, it doesn't matter, the variable with an outer binding doesn't change value.

Consider this Perl code which is exactly the same as before.

sub output {
    my ($color_id, $s) = @_;

    sub colorize {
        my ($s) = @_;
        return "\033[1;3${color_id}m${s}\033[0m";
    }

    print colorize("$s\n");
}


for (my $e = 1; $e < 4; $e++) {
    output($e, "sudo make me a sandwich");
}

How do you think it evaluates?

Oh no, it's broken! Why? Because the first time colorize is evaluated, the value of ${color_id} is recorded and stored for all eternity. The term lexical in this example isn't helpful at all, because the function is *always* evaluated where it was declared, it's not passed to some other place where the value of ${color_id} could have been decided by someone other than output. 'pedia says lexical scoping is also called static scoping, which makes more sense here.

Interestingly, in the language of tweaks that is Perl you can replace my with local on line 2 and you got yourself dynamic scoping! :-o The code will run as expected now.

Which is better?

I don't know. I don't have any conclusions yet. I got into the habit of writing inner functions in Python without passing in all the arguments, it's useful sometimes when you have a lot of variables in scope. And then I got in trouble for doing the same thing in Perl.

In languages without assignment, they will obviously pick lexical, because it reinforces the rule of referential transparency. A variable assigned always keeps the same value.

You need lexical scoping to have closures. A function being defined has to be able to capture and store the values of its unbound variables. Otherwise you could pass it to some other scope that doesn't have bindings for variables with those names, and then what?

But you know what? Python has closures anyway. Here, colorize is defined inside output, but it's not called. It passes back to the loop, and it's called there. But that scope doesn't have a binding for color_id! And yet it still works.

def output(color_id):

    def colorize(s):
        return "\033[1;3%sm%s\033[0m" % (color_id, s)

    return colorize


for e in range(1, 4):
    f = output(e)
    print f("sudo make me a sandwich")

If you try the same thing with Perl and local in place of my, and set $color_id to $e, it works too.

So at least for Python and Perl, you can't reasonably say "dynamic scoping" or "lexical scoping". They do a bit of both. So why is that? Are the concepts dynamic and lexical just too simplistic to use in the "real world"?