Archive for July, 2008

beware of shell expansion

July 24th, 2008

This is one of those details that will bite you if you don't know about it and you might struggle to find the answer for something that looks like a bug.

The characters * and ? have special meaning in the shell, as you know. * means zero or more characters while ? means any character. Favorably, these glob patterns are still the same for those of us who go back to the murky days of Ms DOS.

Glob patterns are very useful in the shell, but when used with programs that themselves accept glob patters the results may be surprising. For example:

$ ls
file2.zip  file.zip
$ find -iname *.zip
find: paths must precede expression
Usage: find [-H] [-L] [-P] [path...] [expression]
$ find . -iname *.zip
find: paths must precede expression
Usage: find [-H] [-L] [-P] [path...] [expression]

Huh? To see what's happening here, run the shell in echo mode:

$ bash -x
$ find -iname *.zip
+ find -iname file2.zip file.zip
find: paths must precede expression
Usage: find [-H] [-L] [-P] [path...] [expression]

So now we see what's happening. *.zip was expanded because it matched two files in this directory. Those two files were then passed as arguments to find, not the *.zip pattern.

Bash will expand the glob pattern whether or not you use it with find, so you just have to tell it not to expand it:

$ find -iname '*.zip'
+ find -iname '*.zip'
./file2.zip
./file.zip

This is confusing, because if *.zip doesn't match any files, then it will be sent verbatim to the program. Therefore you should always quote your glob patterns if they are meant for a program, not the shell.

New York City is just like Utrecht

July 21st, 2008

Bikes gets stolen constantly and noone gives a shit. Check it.

I'm down three bikes so far, in two years. Starting with this one. I also had another bike vandalized once.

dialects

July 16th, 2008

http://www.ling.hf.ntnu.no/nos/nos_kart.html

http://swedia.ling.umu.se/snabbmeny.html

idioms for robust python

July 15th, 2008

Robustness is an interesting notion, isn't it? It isn't about being prepared for what you expect to happen, but actually for what you don't expect. If you take this somewhat vague idea of robustness a step further you drift towards the more software engineering-y practice of defensive programming.

Before delving in, it might be good to consider what precisely you can hope to achieve with robust code. If your program crashes with a KeyError on a dictionary lookup, it's not very robust. On the other hand, if you keep getting AttributeErrors because your socket object is null because the network is dead, you have bigger problems than attribute access.

Robust code doesn't absolve you from error handling. Your program will experience error conditions, and you have to design it so that you can handle them in the right place. If your code is robust, you can achieve this goal: errors will be caught without crashing your program.

Accessing attributes

Attribute access is a minefield. I know, the object oriented philosophy makes it sound like a trifle, but it's not. When you first started coding you probably wrote code like this:

class Bottle(object):
    def capacity(self):
        return 45

# ...

print bottle.capacity()

It looks very innocuous, but what could go wrong here? We make the assumption, perhaps unwittingly, that bottle has been initialized at this point in time. Suppose bottle is an instance member that was set to None initially, and was supposed to be instantiated by the time we execute this line. Those of us who've been on a few java gulags know that this is how the recurring NullPointerException nightmare begins. In Python you get an AttributeError (sounds more benign, doesn't it?).

If you expected to receive bottle from a database or the network, you probably have good reason to suspect that it might be null. So you'll probably write a lot of code like this:

if bottle and bottle.capacity:
    print bottle.capacity()

If bottle isn't null (None, 0 or an empty string/iterable), we think everything is in order. We might also check the attribute we want to make sure that too is not null. The trouble is, that is an attribute access. So if bottle isn't null, but missing capacity, there's your AttributeError again!

It should be obvious by now, that calling any method on bottle is off the table, in case bottle is null. Instead, let's do it this way:

f = getattr(bottle, 'capacity', None)
if f:
    print f()

getattr is a very clever idea. You tell it to get the capacity attribute from the bottle object, and in case that attribute doesn't exist, just return None. The only way you'll get an exception here (a NameError) is if bottle isn't defined in the namespace. So once we have this object, either capacity or None, we check that it's not null, and then call the method.

You might think that this seems like low level nitpicking. And anyway, how do you know that capacity is a method, you could still get a TypeError here. Why not just check if bottle is an instance of the class Bottle. If it is, then it's reasonable to expect capacity is a method too:

if isinstance(bottle, Bottle):
    print bottle.capacity()

This isn't as robust as it seems. Remember that we're trying to prepare for something that wasn't planned. Suppose that someone moved capacity to a baseclass (superclass) of Bottle. Now we are saying only Bottle instances can use the capacity method, even though other objects also have it.

It's more Pythonic to cast a wider net and not be so restrictive. We could use getattr to get the object that we expect is a method. And then we can check if it's a function:

unkn = getattr(bottle, 'capacity', None)
import types
if isinstance(unkn, types.FunctionType):
    print unkn()

This doesn't work, because a method is not of type function. You can call it, but it's not a function (queer, isn't it?). An instance of a class that implements __call__ is also callable, but also not a function. So we should check if the object has a __call__ method, because that's what makes it callable:

unkn = getattr(bottle, 'capacity', None)
if callable(unkn):
    print unkn()

Now obviously, writing every method call in your program like this would be madness. As a coder, you have to consider the degree of uncertainty about your objects.

Another way to go about this is to embrace exceptions. You could also write the most naive code and just wrap a try/except around it. I don't enjoy that style as much, because try/except alters the control flow of your program. This merits a longer discussion, but basically you have to increment the level of indentation, variable scope is a concern, and exceptions easily add up.

Setting attributes

If you only want to set a predetermined attribute, then nothing is easier (obviously this won't work on objects that use slots, like a dict). You can set attributes both for instances and classes:

bottle.volume = 4
Bottle.volume = 4

But if the attribute name is going to be determined by some piece of data (like the name of a field in a database table, say), you need another approach. You could just set the attribute in the object's __dict__:

bottle.__dict__['volume'] = 4
Bottle.__dict__['volume'] = 4    ## fails

But this is considered poor style, __dict__ isn't supposed to be accessed explicitly by other objects. Furthermore, the __dict__ of a class is exposed as a dictproxy object, so you can't do this to set a class attribute. But you can use setattr:

setattr(bottle, 'volume', 4)
setattr(Bottle, 'volume', 4)

Dictionary lookup

Dictionaries, the bedrock of Python. We use them all the time, not always wisely. The naive approach is to assume the key exists:

bases = {"first": "Who", "second": "What"}

print bases["third"]    ## raises KeyError

Failing that, dicts have a has_key method just for this purpose:

if bases.has_key("third"):
    print bases["third"]

But it's more Pythonic to keep it simple as can be:

if "third" in bases:
    print bases["third"]

dicts also have a failsafe method equivalent to getattr, called get. You can also give it a default value (as the second parameter, not shown here) to return if the key doesn't exist:

third = bases.get("third")
if third:
    print third

I would argue that it's preferable, because you don't have to look up the element twice. (And you don't risk defeat snatched from the jaws of victory if a context switch occurs between those two statements and another thread removes the key after you've checked for it.)

resume downloads safely

July 8th, 2008

You can resume an http download by setting the Range header, or an ftp download with the REST command. Not all hosts support this, but many do.

If you use wget a lot, have you ever asked yourself why --continue isn't on by default? Surely it's better to resume an interrupted download than to restart it? If you look up the man page, it has this healthy reminder for you:

-c, --continue
Continue getting a partially-downloaded file. This is useful when you want to finish up a download started by a previous instance of Wget, or by another program. For instance:

wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z

If there is a file named ls-lR.Z in the current directory, Wget will assume that it is the first portion of the remote file, and will ask the server to continue the retrieval from an offset equal to the length of the local file.

Note that you don’t need to specify this option if you just want the cur‐ rent invocation of Wget to retry downloading a file should the connection be lost midway through. This is the default behavior. -c only affects resumption of downloads started prior to this invocation of Wget, and whose local files are still sitting around.

Without -c, the previous example would just download the remote file to ls-lR.Z.1, leaving the truncated ls-lR.Z file alone.

Beginning with Wget 1.7, if you use -c on a non-empty file, and it turns out that the server does not support continued downloading, Wget will refuse to start the download from scratch, which would effectively ruin existing contents. If you really want the download to start from scratch, remove the file.

Also beginning with Wget 1.7, if you use -c on a file which is of equal size as the one on the server, Wget will refuse to download the file and print an explanatory message. The same happens when the file is smaller on the server than locally (presumably because it was changed on the server since your last download attempt)---because ‘‘continuing’’ is not meaningful, no download occurs.

On the other side of the coin, while using -c, any file that’s bigger on the server than locally will be considered an incomplete download and only "(length(remote) - length(local))" bytes will be downloaded and tacked onto the end of the local file. This behavior can be desirable in certain cases---for instance, you can use wget -c to download just the new portion that’s been appended to a data collection or log file.

However, if the file is bigger on the server because it’s been changed, as opposed to just appended to, you’ll end up with a garbled file. Wget has no way of verifying that the local file is really a valid prefix of the remote file. You need to be especially careful of this when using -c in conjunction with -r, since every file will be considered as an "incomplete download" candidate.

Another instance where you’ll get a garbled file if you try to use -c is if you have a lame HTTP proxy that inserts a ‘‘transfer interrupted’’ string into the local file. In the future a ‘‘rollback’’ option may be added to deal with this case.

I pasted the whole thing here, because it nicely summarizes the many reasons why "resume by default" is not safe.

As a matter of fact, that's not all. wget doesn't even know if the local file with the same name is the same file that's on the server. And even if it is, the first attempt at downloading presumably didn't succeed, which while it may be unlikely, could perhaps have corrupted the local file. So even if you download the rest of it, you won't be able to use it anyway.

What can we do about this? To be sure that a) it's the same file and b) it's uncorrupted, we have to download the whole thing. That is, for obvious reasons, not desirable. Instead, I propose re-downloading the last portion of the file as a checksum. The fetcher in spiderfetch uses the last 10kb of the local file to determine if the resume should proceed. If the last 10kb of the local file doesn't agree with the same 10kb of the remote file, the fetcher exits with a "checksum" error.

The main benefit of this method is to verify that it's the same file. Clearly, it can still fail, but I imagine that with most file formats 10kb is enough to detect a divergence between two files.