Archive for June, 2008

spiderfetch, now in python

June 28th, 2008

Coding at its most fun is exploratory. It's exciting to try your hand at something new and see how it develops, choosing a route as you go along. Some poeple like to call this "expanding your ignorance", to convey that you cannot decide on things you don't know about, so first you have to become aware - and ignorant - of them. Then you can tackle them. If you want a buzzword for this I suppose you could call this "impulse driven development".

spiderfetch was driven completely by impulse. The original idea was to get rid of awkward, one-time grep/sed/awk parsing to extract urls from web pages. Then came the impulse "hey, it took so much work to get this working well, why not make it recursive at little added effort". And from there on countless more impulses happened, to the point that it would be a challenge to recreate the thought process from there to here.

Eventually it landed on a 400 line ruby script that worked quite nicely, supported recipes to drive the spider and various other gimmicks. Because the process was completely driven by impulse, the code became increasingly dense and monolithic as more impulses were realized. And it got to the point where the code worked, but was pretty much a dead end from a development point of view. Generally speaking, the deeper you go into a project, gradually the lesser the ideas have to be to be realized without major changes.

Introducing the web

The most disruptive new impulse was that since we're spidering anyway, it might be fun to collect these urls in a graph and be able to do little queries on them. At the very least things like "what page did I find this url on" and "how did I get here from the root url" could be useful.

spiderfetch introduces the web, a local representation of the urls the spider has seen, either visited (spidered) or matched by any of the rules. Webs are stored, quite simply, in .web files. Technically speaking, the web is a graph of url nodes, with a hash table frontend for quick lookup and duplicate detection. Every node carries information about incoming urls (locations where this url was found) and outgoing urls (links to other documents), so the path from the root to any given url can be traced.

Detecting file types

Aside from the web impulse, the single biggest flaw in spiderfetch was the lack of logic to deal with filetypes. Filetypes on the web work pretty much as well as they do on your local computer, which means if you rename a .jpg to a .gif, suddenly it's not a .jpg anymore. File extensions are a very weak form of metadata and largely useless. Just the same with spidering, if you find a url on a page you have no idea what it is. If it ends in .html then it's probably that, but it can also not have an extension at all. Or it can be misleading, which when taken to perverse lengths (eg. scripts like gallery), does away with .jpgs altogether and encodes everything as .php.

In other words, file extensions tell you nothing that you can actually trust. And that's a crucial distinction: what information do I have vs what can I trust. In Linux we deal with this using magic. The file command opens the file, reads a portion of it, and scans for well known content that would identify the file as a known type.

For a spider this is a big roadblock, because if you don't know what urls are actual html files that you want to spider, you have to pretty much download everything. Including potentially large files like videos that are a complete waste of time (and bandwidth). So spiderfetch brings the "magic" principle to spidering. We start a download and wait until we have enough of the file to check the type. If it's the wrong type, we abort. Right now we only detect html, but there is a potential for extending this with all the information the file command has (this would involve writing a parser for "magic" files, though).

A brand new fetcher

To make filetype detection work, we have to be able to do more than just start a download and wait until it's done. spiderfetch has a completely new fetcher in pure python (no more calling wget). The fetcher is actually the whole reason why the switch to python happened in the first place. I was looking through the ruby documentation in terms of what I needed from the library and I soon realized it wasn't cutting it. The http stuff was just too puny. I looked up the same topic in the python docs and immediately realized that it will support what I want to do. In retrospect, the python urllib/httplib library has covered me very well.

The fetcher has to do a lot of error handling on all the various conditions that can occur, which means it also has a much deeper awareness of the possible errors. It's very useful to know whether a fetch failed on 404 or a dns error. The python library also makes it easy to customize what happens on the various http status codes.

A modular approach

The present python code is a far cry from the abandoned ruby codebase. For starters, it's three times larger. Python may be a little more verbose than ruby, but the increase is due to a new modularity and most of all, new features. While the ruby code had eventually evolved into one big chunk of code, the python codebase is a number of modules, each of which can be extended quite easily. The spider and fetcher can both be used on their own, there is the new web module to deal with webs, and there is spiderfetch itself. dumpstream has also been rewritten from shellscript to python and has become more reliable.

Grab it from github:

spiderfetch-0.4.0

emacs that firefox!

June 24th, 2008

So the other day I was thinking what a pain it is to handle text in input boxes on web pages, especially when you're writing something longer. Since I started using vim for coding I've become aware of how much more efficient it is to edit when you have keyboard shortcuts to accelerate common input operations.

I discovered a while back that bash has input modes for both vi and emacs and ever since then editing earlier commands is so much easier. And not only does it work in bash, but just as well in anything else, like ipython, irb, whatever.

So now only Firefox remains of my most used applications that still has the problem of stoneage editing, and I'm stuck using the mouse way too much. It bugs me that I can't do Ctrl+w to kill a word. Thus I went hunting for an emacs extensions and what do you know, of course there is one: Firemacs. Turns out it works well, and it also has keyboard shortcuts for navigation. > gets you to the bottom of the page, no more having to hold down <space>.

iphone = sexism

June 24th, 2008

Finger challenged women are complaining about the iphone because their uglyass [fake] long nails prevent them from using the touchscreen comfortably.

Oh man, this is too much. What's next, people who wear their hair down at shoe level will complain that it gets messed up because the street is dirty?

One clever missy has the answer, though.

I wouldn't go as far as to call it misogyny, but it sure is annoying. They should just do what I do, keep one fingernail short for hindrances such as this.

the Swedish Pirate Party

June 17th, 2008

Rick Falkvinge of the Swedish Pirate Party gives a talk at google. It's one of the best talks about free culture and "intellectual property" I've seen. I also learned that the Norwegian Liberal Party (Venstre) has adopted the same stance on free culture, bravo!

If you have reservations about the implications of copyright reform, go watch this talk, he gets all these questions from the audience.

The soundbite from Falkvinge's talk for all you 24hour news media addicts:

Copyright, while written into law that it's supposed to be for the benefit of the author, never was. It was for the benefit of the distributors.

renewip: when the router keeps disconnecting

June 15th, 2008

So we now all have broadband connections and everything is great, right? Well, not quite. Some providers have better services than others. My connection seems rather fragile at times and tends to die about once in three-four days. When that happens, no amount of resetting the equipment helps to get it working again. It's an upstream issue that I have no control over.

But there is another problem. Once the cable modem starts working again, the router (which receives an IP address from my provider, and serves LAN and wifi locally) doesn't seem to know this and doesn't automatically re-establish a connection. Or I'm not really sure what it does, it's a black box and there is a web interface to it, where there's a button to press to do this, which sometimes works. But what really is happening, who knows. There seems to be a weird timing problem to the whole thing, where if I kill the power for both the modem and the router and they both come back at the same time, it generally works. However, if the modem is taking longer to negotiate a link, the router will be disconnected. And apparently doesn't try to reconnect on its own, so I've been stuck rebooting the two a few times until the timing is right. Resetting them separately for some reason doesn't seem to work.

So what can be done about it? Well, the router does have that stupid web interface, so it's possible to make those clicks automatically if we're disconnected. Python's urllib makes this very easy to do. First we login with router_login, which submits a form with POST. Then we check the state of the internet connection with check_router_state, which just reads out the relevant information from the page. And if it's disconnected we run renew_router_connection to submit another form (ie. simulating the button click on the web page).

Testing connectivity

More than just testing if the router has a connection to the provider, broadband connections sometimes have connectivity problems. Even if you can get a connection, the provider sometimes has problems on his network, meaning your connection doesn't work anyway.

So I came up with a test to see how well the connection is working. It's an optimistic test, so that first we assume we have a fully functional connection and ping yahoo.com. It doesn't matter what host we use here, just some internet host that is known to be reliable and "always" available. For this to work these conditions must be met:

  1. We have to reach the gateway of the subnet where our broadband IP address lives.
  2. We have to reach the provider's nameserver (known as dns1 in the code) to look up the host "yahoo.com".
  3. We have to reach yahoo.com (we have their IP address now).

So first we ping yahoo.com. If that fails, it could be because dns lookup failed. So we ping the provider's nameserver. If that fails, the provider's internal routing is probably screwed up, so we ping the gateway. And if that fails too then we know that although we have an IP address, the connection is dead (or very unstable).

#!/usr/bin/env python
#
# Author: Martin Matusiak <numerodix@gmail.com>
# Licensed under the GNU Public License, version 3.

import os
import re
import sys
import time
import urllib

ip_factory = "192.168.2.1"
password = ""

inet_host = "yahoo.com"


def write(s):
    sys.stdout.write(s)
    sys.stdout.flush()

def grep(needle, haystack):
    if needle and haystack:
        m = re.search(needle, haystack)
        if m and m.groups(): return m.groups()[0]

def invoke(cmd):
    (sin, sout) = os.popen2(cmd)
    return sout.read()

def ping(host):
    cmd = 'ping -c1 -n -w2 ' + host + ' 2>&1'
    res = invoke(cmd)
    v = grep("rtt min/avg/max/mdev = [0-9.]+/([0-9.]+)/[0-9.]+/[0-9.]+ ms", res)
    if v: return int(float(v))

def find_lan_gateway():
    cmd = "route -n"
    res = invoke(cmd)
    v = grep("[0-9.]+\s+([0-9.]+)\s+[0-9.]+\s+UG", res)
    if v: return v

def load_url(url, params=None):
    data = None
    if params: data = urllib.urlencode(params)
    f = urllib.urlopen(url, data)
    return f.read()


def router_login():
    form = {"page": "login", "pws": password}
    load_url("http://%s/login.htm" % ip, form)

def check_router_state():
    state = { "conn": None, "gateway": None, "dns1": None }
    router_login()
    s = load_url("http://%s/js/js_status_main.htm" % ip)
    if s:
        v = grep("var bWanConnected=([0-9]);", s)
        if v == "1": state['conn'] = True
        elif v == "0": state['conn'] = False
        if state['conn']:
            g = grep('writit\("([0-9.]+)","GATEWAY"\);', s)
            if g and g != "0.0.0.0": state['gateway'] = g
            g = grep('writit\("([0-9.]+)","DNSIP"\);', s)
            if g and g != "0.0.0.0": state['dns1'] = g
    return state
    
def renew_router_connection():
    router_login()
    form = {"page": "status_main", "button": "dhcprenew"}
    s = load_url("http://%s/status_main.htm" % ip, form)
    return s



ip = find_lan_gateway()
if not ip:
    ip = ip_factory
    write("LAN gateway detection failed, using factory ip %s for router\n" % ip_factory)
else:
    write("Router ip: %s\n" % ip)

while True:
    try:
        router = check_router_state()
        t = time.strftime("%H:%M:%S", time.localtime())
        if router['conn']:
            
            hosts = [(inet_host, inet_host),
                ("dns1", router['dns1']), ("gateway", router['gateway'])]
            connectivity = ""
            write("[%s] Connected  " % t)
            for (name, host) in hosts:
                delay = ping(host)
                if delay:
                    write("(%s: %s) " % (name, delay))
                    break
                else:
                    write("(%s !!) " % name)

            write("\n")
        else:
            write("[%s] NOT CONNECTED, attempting reconnect\n" % t)
            renew_router_connection()
    except Exception, e:
        cls = grep("<type 'exceptions.(.*)'", str(e.__class__))
        write("%s: %s\n" % (cls, e))
    time.sleep(3)