Archive for April, 2008

war is a racket

April 30th, 2008

For all the patriotic baloney nations are fed in pre-war time, with grandiose appeals to moral rightousness and complete confidence in their own success, it is little more than powerful, rich men sending clueless (or powerless) poor men to their death.

War is a racket. It always has been. It is possibly the oldest, easily the most profitable, surely the most vicious. It is international in scope. It is the only one in which the profits are reckoned in dollars and the losses in lives.

A racket is best described, I believe, as something that is not what it seems to the majority of the people. Only a small "inside" group knows what it is about. It is conducted for the benefit of the very few, at the expense of the very many. Out of war a few people make huge fortunes.

Who wrote this? Why, only the highly decorated general Smedley D. Butler, in 1935.

Yeap, that's right, folks. The plot in Inside Man wasn't made up. It was a real plot about a fictional person, crafted on the histories of real people.

Here's another truth ringer:

Like all the members of the military profession, I never had a thought of my own until I left the service.

But of course. Who in their right mind would go kill people at the risk of getting killed just so that a few rich men can get richer?

spiderfetch, part 2

April 28th, 2008

Note: If you haven't read part 1 you may be a little lost here.

So, the inevitable happened (as it always does, duh). I start out with a simple problem and not too many ambitions about what I want to accomplish. But once I reach that plateau, nay well before reaching it, I begin to ever so quietly ask myself the question "wait a second, what if x?" and "this looks specialized, I wonder if I could generalize...". And so before ever even reaching that hill top I've already, covertly, committed myself to taking it one step further. Not through a conscious decision, but through those lingering peripheral thoughts that you know won't disappear once they've struck. A bell cannot be unrung and all that.

I realized this was happening, but I didn't want to get into too much grubby stuff in the first blog entry, so I decided to keep that one simple and continue the story here. The first incarnation of spiderfetch had a couple of flaws that bugged me.

  1. No way to inspect how urls were being matched on the page, or even reason to believe this was happening correctly, other than giving an input and checking that all the expected urls were found. To make matching evident, I would need to be able to see the matches visually on the page.
    This has been addressed with a new option --dumpcolor, which dumps the index page and highlights the matches. This has made it much easier to verify that matching is done correctly.
  2. Matching wasn't sufficiently effective. The regex I had written would match urls inside tags, as long as they were in quotes. But this would still miss unquoted urls, and it also excluded all other urls on the page, which may or may not be of interest. I also realized that a single regex, no matter how refined, would be unlikely to match simultaneously all the urls that may be of interest.
    The obvious response is to add an option for multiple regexes, which is exactly what happened. This obviously adds another layer of complexity to debugging regexes, so the match highlighting was extended to colorize every match in a different color. Furthermore, where two regexes would match the same characters, the highlighting is in bold to indicate this.

With that, I was far happier with the ability to infer and verify correctness in the matching behavior. Surely now everything is honkey dorey?

Or not? (As a classmate of mine likes to say after delivering a convincing argument, but graciously gives you the chance to state your objections anyway). Well, if you read part 1 of this adventure right to the end, noting the observation that spiderfetch could be run recursively, you may have thought what I thought. Well gosh, Bubba, this is starting to sound like wget --mirror. Since I've set up all this infrastructure already -- to spider a single page -- it wouldn't really take much to generalize it to run recursively.

There are a couple of problems to solve, however. Firstly, the operational model for spiderfetch was very simple: spider a page, then fetch all the urls that match the pattern. In terms of multiplicity: 1 page to spider, 1 pattern to match urls against, n urls to find. If we now take this a step further, in the next pass we have n urls to spider (obtained form the n urls found in the first step), and we may need 1 pattern to filter some of them. Next, we spider these pages, which produces (m1+m2+...) (or roughly, n*m) urls and so on. This becomes rather convoluted to explain in words, so let's visualize.

Starting at the url to be spidered (the top green node), we spider the page for urls. For each of the urls found, it ends up in one of three categories:

  1. It matches the spider filter, so it becomes a url to spider in the next round (a black arrow).
  2. It matches the fetch filter, so it becomes a url to fetch (a blue arrow).
  3. It matches neither and is discarded (not shown).

In the next round, we gather up all the urls that are to be spidered (the black arrows starting at the top green node) and do the same thing for each one as we did with just the one page to begin with.

But this complicates matters quite a lot. We now have to deal with a bunch of new issues:

  1. How do we traverse the nodes? wget in mirror/spider modes goes depth-first, which I always thought was eccentric. I don't know why they do it this way, but I'm guessing to minimize memory use. If you go breadth-first then at every step you have to keep track of all the green nodes at the current level, which grows exponentially. Meanwhile, depth-first give you linear growth, so that choice is well justified. But, on the other hand, the traversal order seems a bit unintuitive, because you "jump" from the deepest corner of your filesystem back to the top level and so on. I wonder if this turns out to be foolish (I don't expect spiderfetch to get the same kind of workout that wget does, obviously), but I've chosen the opposite approach, which I think also makes it easier to track what is happening underway.
  2. How deep do we want to go? Do we want to set an upper bound or (gasp) let it run until it stops?
  3. Until now we've only needed one filter (for the blue arrows at the top green node). Now we suddenly have a lot more arrows that we should be able to filter in some meaningful way. Obviously, we don't want a pair of filters for every single node. Not only would that be madness, but we don't know in advance how many nodes there will be.
    Our old friend wget only has one filter you can set for the whole site. But we want to be more specific than that, so there is a pair of filters (spider, fetch) for every level of the tree. This gives pretty decent granularity.

So how can we represent this cleanly? Well, it would be rather messy to have to input this as a command line parameter, besides which a once written scheme for a particular scenario could be reusable. So instead we introduce the idea of a recipe composed of rules. Starting from the top of the tree, each rule applies to the next level of the tree. And once we have no more rules -- or no more urls to spider -- we stop.

Let's take the asx example from part 1, where we had a custom made bash script to do the job. We can now rewrite it like this. First, the recipe is a list of rules, each rule is a hash. So starting from the top green node, we grab the first rule in the list, the one that contains the symbol :spider. This gives us the pattern to match urls on that page for spidering. There are no other patterns in there, so we spider these urls and then move on to the next step. We are now at the level below the top green node in the tree, with a bunch of pages from urls ending in .asx. We now grab the next rule in the recipe. This one gives a pattern for :dump, which means "dump these urls to the screen". So we find all the urls that match this pattern in all of our green nodes and dump them. Since there are no more rules left, this is where we stop.

module Recipe 
	RECIPE = [
		{ :spider => "\.asx$" },
		{ :dump => "^mms:\/\/" },
	]
end

So you would use it like this:

spiderfetch.rb --recipe asx http://www.something.com/somewhere

The options for patterns are :spider, :fetch, and :dump. If you want to repeat the same rule several times (for example to spider an image gallery with 10 pages, which are linked together with Next and Previous links), you can also set :depth to a positive integer value. This will descend in the tree the given number of times, using the same rule again and again.

And if you're feeling completely mental, you can even set :depth => -1, which will repeat the same rule until it runs out of urls to spider. You should probably combine this with --host, which will make sure you only spider the host (domain, to be exact) you started with, rather than the whole internet. (It will still allow :fetch and :dump to match urls on other hosts, so if you're spidering for images and they live on http://img.host.com rather than http://www.host.com, they will still be found.)

Lastly, as a heavy handed arbitration measure, if you execute a recipe and pass either of --dump or --fetch this will switch all your :fetch patterns to :dump or vice versa. Might be nice to be able to check that the right urls are being found before you start fetching, for instance.

Download and go nuts:

spiderfetch-0.3.1.tar.gz

UPDATE: Paul Hawkins wrote to say that wget actually runs in breadth-first mode.

download all media links on a webpage

April 26th, 2008

This has probably happened to you. You come to a web page that has links to a bunch of pictures, or videos, or documents that you want to download. Not one or two, but all. How do you go about it? Personally, I use wget for anything that will take a while to download. It's wonderful, accepts http, https, ftp etc, has options to resume and retry, it never fails. I could just use Firefox, and if it's small files then I do just that, and click all the links in one fell swoop, then let them all download on their own. But if it's larger files then it's not practical. You don't want to download 20 videos of 200mb each in parallel, that's no good. If Firefox crashes within the next few hours (which it probably will) then you'll likely end up with not even one file successfully downloaded. And Firefox doesn't have a resume function (there is a button but it doesn't do anything ).

So there is a fallback option: copy all the links from Firefox and queue them up for wget: right click in document, Copy Link Location, right click in terminal window. This is painful and I last about 4-5 links before I get sick of it, download the web page and start parsing it instead. That always works, but I have to rig up a new chain of grep, sed, tr and xargs wget (or a for loop) for every page, I can never reuse that and so the effort doesn't go a long way.

There is another option. I could use a Firefox extension for this, there are some of them for this purpose. But that too is fraught with pain. Some of them don't work, some only work for some types of files, some still require some amount of manual effort to pick the right urls and so on, some of them don't support resuming a download after Firefox crashes. Not to mention that every new extension slows down Firefox and adds another upgrade cycle you have to worry about. Want to run Firefox 3? Oh sorry, your download extension isn't compatible. wget, in contrast, never stops working. Most limiting of all, these extensions aren't Unix-y. They assume they know what you want, and they take you from start to end. There's no way you can plug in grep somewhere in the chain to filter out things you don't want, for example.

So the problem is eventually reduced to: how can I still use wget? Well, browsers being as lenient as they are, it's difficult to guarantee that you can parse every page, but you can at least try. spiderfetch, whose name describes its function: spider a page for links and then fetch them, attacks the common scenario. You find a page that links to a bunch of media files. So you feed the url to spiderfetch. It will download the page and find all the links (as best it can). It will then download the files one by one. Internally, it uses wget, so you still get the desired functionality and the familiar output.

If the urls on the page require additional post-processing, say they are .asx files you have to download one by one, grab the mms:// url inside, and mplayer -dumpstream, you at least get the first half of the chain. (Unlikely scenario? If you wanted to download these freely available lectures on compilers from the University of Washington, you have little choice. You could even chain spiderfetch to do both: first spider the index page, download all the .asx files, then spider each .asx file for the mms:// url, print it to the screen and let mplayer take it from there. No more grep or sed. )

Features

  • Spiders the page for anything that looks like a url.
  • Ability to filter urls for a regular expression (keep in mind this is still Ruby's regex, so .* to match any character, not * as in file globbing, (true|false) for choice and so on.)
  • Downloads all the urls serially, or just outputs to screen (with --dump) if you want to filter/sort/etc.
  • Can use an existing index file (with --useindex), but then if there are relative links among the urls, they will need post-processing, because the path of the index page on the server is not known after it has been stored locally.
  • Uses wget internally and relays its output as well. Supports http, https and ftp urls.
  • Semantics consistent with for url in urls; do wget $url... does not re-download completed files, resumes downloads, retries interrupted transfers.

Limitations

  • Not guaranteed to find every last url, although the matching is pretty lenient. If you can't match a certain url you're still stuck with grep and sed.
  • If you have to authenticate yourself somehow in the browser to be able to download your media files, spiderfetch won't be able to download them (as with wget in general). However, all is not lost. If the urls are ftp or the web server uses simple authentication, you can still post-process them to: ftp://username:password@the.rest.of.the.url, same for http.

Download spiderfetch:

Recipes

To make the use a bit clearer, let's see some concrete examples.

Recipe: Download the 2008 lectures from Fosdem:

spiderfetch.rb http://www.fosdem.org/2008/media/video 2008.*ogg

Here we use the pattern 2008.*ogg. If you first run spiderfetch with --dump, you'll see that all the urls for the lectures in 2008 contain the string 2008. Further, all the video files have the extension ogg. And whatever characters come in between those two things, we don't care.

Recipe: Download .asx => mms videos

Like it or not, sometimes you have to deal with ugly proprietary protocols. Video files exposed as .asx files are typically pointers to urls of the mms:// protocol. Microsoft calls them metafiles. This snippet illustrates how you can download them. First you spider for all the .asx urls, using the pattern \.asx$, which means "match on strings containing .asx as the last characters of the string". Then we spider each of those urls for actual urls to video files, which begin with mms. And for each one we use mplayer -dumpstream to actually download the video.

#!/bin/bash

mypath=$(cd $(dirname $0); pwd)
webpage="$1"

for url in $($mypath/spiderfetch.rb $webpage "\.asx$" --dump); do 
	video=$($mypath/spiderfetch.rb $url "^mms" --dump)
	mplayer -dumpstream $video -dumpfile $(basename $video)
done

clocking jruby1.1

April 21st, 2008

Did you hear the exciting news? JRuby 1.1 is out! For real, you can call your grandma with the great news. Wow, that was quick.

Okay, so the big new thing in JRuby is a bytecode compiler. As you may know, up to 1.0 it was just a Ruby interpreter in Java. Now you can actually compile Ruby modules to Java classes and no one will know the difference, very devious. Sounds like Robin Hood in a way, doesn't it?

The JRuby guys are claiming that this makes JRuby on par with "regular Ruby" on performance, if not better. Hmm. Just to be on the safe side, what size shoes do you wear? Oh ouch, those are going to be tricky to fit in your mouth. And Freud will say you're stuck in the oral stage. Too much? Okay.

So here is my completely unvetted, dirty, real world test. No laboratory conditions here, you're in the ghetto. First we need something *to* test. I don't have a great deal of Ruby code at my disposal, but this should do the trick. How does scanning the raw filesystem for urls sound? The old harvest script actually does a half decent job of turning up a bunch of findings.

Now introducing the contenders. First up, his name is JRuby, you know him from occasional mentions on obscure blogs and the programming reddit past the top 500 entries. He promises to free all Java slaves by giving away free Rubies to everyone!

Aaand the incumbent, the famous... Ruby! You know him, your parents know him, every family would adopt him as their own child if they could. He's the destroyer of kingdoms and the creator of empires, he's bigger than Moses himself!

Our two drivers will be racing across a hostile territory. Your track is a 25gb ext3 live file system. During this time, I can promise you that only Firefox is likely to be writing new urls to disk, but I could be lying eheheh. Due to the unpredictable nature of this rally track, regulations allow only one racer at a time, but you will be clocked.

First up is the new kid on the block Jay....Ruby. The Ruby code will not be compiled before execution, we'll let the just-in-time compiler do its thing.

$ time ( sudo cat /dev/sda5 | bin/jruby harvest.rb --url > /tmp/fsurls.jruby )
real 39m26.547s
user 37m19.072s
sys 1m28.406s

Not too shabby for a first run, but since this a brand new venue, we have no frame of reference yet. Let's see how Ruby will do here.

$ time ( sudo cat /dev/sda5 | harvest.rb --url > /tmp/fsurls.ruby )
real 78m42.186s
user 62m12.537s
sys 2m18.721s

Well, look at that! The new kid is pretty slick, isn't he? Sure is giving the old man a run for his money. Let's see how they answered the questions.

$ lh
-rw-r--r-- 1 alex alex 86M 2008-04-21 18:29 fsurls.jruby
-rw-r--r-- 1 alex alex 8.6G 2008-04-21 20:58 fsurls.ruby

Yowza! No less than a hundred times more matches with Ruby. What is going on here? Did Jay just race to the finish line, dropping the vast majority of his parcels? Or did father Ruby see double and triple and quadruple, ending up with lots and lots of duplicates? Well, we don't really *know* how many urls exist in those 25gb of data, but it seems a little bit suspect that there would be in excess of 8gb of them.

One way or the other, it's pretty clear that the regular expression semantics are not entirely identical. In fact, you might be sweating a little right now if your code uses them heavily.

UPDATE: Squashing duplicates in both files actually produces two files of very similar size (13mb), in which the disparity of unique entries is only a very reasonable 4% (considering the file system was being written to in the process). The question still remains how did Ruby produce 8gb of output.

what the heck is a closure?

April 20th, 2008

That's a question that's been bugging me for months now. It's so vexing to try to find something out and not getting it. All the more so when you look it up in a couple of different places and the answers don't seem to have much to do with each other. Obviously, once you have the big picture, all those answers intersect in a meaningful place, but while you're still hunting for it, that's not helpful at all.

I put this question to a wizard and the answer was (not an exact quote):

A function whose free variables have been bound.

Don't you love to get a definition in terms of other terms you're not particularly comfortable with? Just like a math textbook. This answer confused me, because I couldn't think of a case that I had seen where that wasn't the case, so I thought I must be missing something. The Python answer is very simple:

A nested function.

It's sad, but one good answer is enough. When you can't get that, sometimes you end up stacking up several unclear answers and hoping you can piece it all together. And that can very well fail.

I read a definition today that finally made it clear to me. It's not the simplest and far from the most intuitive description. In fact, it too reads like a math textbook. But it's simply what I needed to hear in words that would speak to me.

A lexical closure, often referred to just as a closure, is a function that can refer to and alter the values of bindings established by binding forms that textually include the function definition.

I read it about 3 times, forwards and backwards, carefully making sure that as I was lining up all the pieces in my mind, they were all in agreement with each other. And once I verified that, and double checked it, I felt so relieved. Finally!

I can't follow the Common Lisp example that follows on that page, but scroll down and you find a piece of code that is much simpler.

(define (foo x)
	(define (bar y)
		(+ x y))
	bar)

(foo 1) 5 => 6
(foo 2) 5 => 7

What's going on here? First there is a function being defined. Its name is foo and it takes a parameter x. Now, once we enter the body of this function foo, straight away we have another function definition - a nested function. This inner function is called bar and takes a parameter y. Then comes the body of the function bar, which says "add variables x and y". And then? Follow the indentation (or the parentheses). We have now exited the function definition of bar and we're back in the body of foo, which says "the value bar", so that's the return value of foo: the function bar.

In this example, bar is the closure. Just for a second, look back at how bar is defined in isolation, don't look at the other code. It adds two variables: y, which is the formal parameter to bar, and x. How does x receive its value? It doesn't. Not inside of bar! But if you look at foo in its entirety, you see that x is the formal parameter to foo. Aha! So the value of x, which is set inside of foo, carries through to the inner function bar.

Can we square this code with the answers quoted earlier? Let's try.

A function whose free variables have been bound. - A function, in this case bar. Free variables, in this case x. Bound, in this case defined as the formal parameter x to the function foo.

A nested function. - The function bar.

A lexical closure, often referred to just as a closure, is a function that can refer to and alter the values of bindings established by binding forms that textually include the function definition. - A function, in this case bar. That can refer to and alter, in this case bar refers to the variable x. values of bindings, in this case the value of the bound variable x. established by binding forms, in this case the body of the function foo. that textually include the function definition, in this case foo includes the function definition of bar.

So yes, they all make sense. If you understand what it's all about.

Let's return to the code example. We now call the function foo with argument 1. As we enter foo, x is bound to 1. We now define the function bar and return it, because that is the return value of foo. So now we have the function bar, which takes one argument. We give it the argument 5. As we enter bar, y is bound to 5. And x? Is it an undefined argument, since it's not defined inside bar? No, it's bound *from before*, from when foo was called. So now we add x and y.

In the second call, we call foo with a different argument, thus x inside of bar receives a different value, and once the call to bar is made, this is reflected in the return value.

Well, that was easy. And to think I had to wait so long to clarify such a simple idiom. So what is all the noise about anyway? Think of it as a way to split up the assignment of variables. Suppose you don't want to assign x and y at the same time, because y is a "more dynamic" variable whose value will be determined later. Meanwhile, x is a variable you can assign early, because you know it's not going to need to be changed.

So each time you call foo, you get a version of bar that has a value of x already set. In fact, from this point on, for as long as you use this version of bar, you can think of x as a constant that has the value that it was assigned when foo was called. You can now give this version of bar to someone and they can use it by passing in any value for y that they want. But x is already determined and can't be changed.