Archive for the ‘technology’ Category

python patterns for graph traversal

Monday, March 8th, 2010

python_graphtraversal_picGraphs, they are fun. You can use them for all sorts of things, like drawing a picture of the internet (or your favorite regular expression!)

For example, look at that handsome guy there on the right. Is that a gorgeous graph, or what? (Incidentally, if you aren’t yet a fan of graphviz, it’s time to become one, it’s superb!)

Now take a gander at the code snippet below, that’s the python representation of it.

class Node(object):
    def __init__(self, name):
        self.name = name
 
a, b, c, d = Node('a'), Node('b'), Node('c'), Node('d')
a.refs = [b, d]
b.refs = [c]
c.refs = [a]
d.refs = []

Download this code: python_graphtraversal_model.py

But it’s not there just to be stared at, so let’s do something with it! One thing we could do is go over the graph and print out the nodes. But let’s do one better, let’s also show how deep into the graph we are by indenting the output! What we want is this:

a
__b
____c
__d

Download this code: python_graphtraversal_spec.py

We don’t have the nice arrows here, but you can still make out the shape of the graph.

First iteration

def traverse_naive(node, depth):
    print('__'*depth + node.name)
    for ref in node.refs:
        traverse_naive(ref, depth+1)

Download this code: python_graphtraversal_naive.py

Very simple. Print the name of the node at the current level of depth, and then recurse down the outgoing edges. But when we run this something bad happens:

>>> traverse_naive(a, 0)
a
__b
____c
______a
________b
..
RuntimeError: maximum recursion depth exceeded

Download this code: python_graphtraversal_naive_output.py

There’s no sign of d, instead we keep going round and round along the path a-b-c-a. Woops.

Preempting cycles

Traditionally, graph traversal algorithms have set properties on nodes in the graph to indicate to themselves that a particular node had been seen before. For example, we could set node.been_here_before = True every time we enter a node, and then we could check for this property to make sure we don’t re-enter the node later.

But this is not awesome, because we then have to change the graph as we traverse it. What if we want to traverse it again later, do we then need another traversal algorithm to remove all the markers or what?

There is another way to do this, however. We can use a data structure completely outside the graph in which we keep track of what we’ve seen so far (imagine a guy walking around a warehouse with a clipboard, he then doesn’t have to mark any of the merchandise!).

So instead of checking the value of node.been_here_before, we’re going to check the value of cache[node].

Now, there are two main strategies for where to do this check. We can do it right before the recursive call, or right after.

def traverse_check_before(node, depth, cache):
    cache[node] = None
 
    print('__'*depth + node.name)
    for ref in node.refs:
        # we are about to recurse, first check if the node we want
        # to recurse on is in the cache
        if ref in cache:
            print('Already in cache: %s' % ref.name)
            continue
        # recurse if we reach this point
        traverse_cached(ref, depth+1, cache)
 
 
def traverse_check_after(node, depth, cache):
    # we have just recursed, return if this node is already in
    # the cache
    if node in cache:
        print('Already in cache: %s' % node.name)
        return
    cache[node] = None
 
    print('__'*depth + node.name)
    for ref in node.refs:
        # recurse without exception
        traverse_cached(ref, depth+1, cache)

Download this code: python_graphtraversal_two.py

Here it doesn’t matter which one we use, so we’re just going to use the second option.

Second iteration

At this point you might be thinking that everything is going swimmingly, but in fact a small problem has crept up. It cannot have escaped your attention that we’ve snuck another parameter into that function definition. This means that we have to call the function like this:

>>> traverse_check_after(a, 0, {})
a
__b
____c
Already in cache: a
__d

Download this code: python_graphtraversal_param.py

The function works, but we really, really don’t want to have to do it like this. There is no reason the caller has to know about the cache, let alone that it is a dictionary. And if we ever decide to change the cache mechanism, we have to update all the client code.

So what can we do? We can’t set up the cache in the function body, because the function is recursive!

This is where you might get that look in your eye that you get when you’re being clever. Hah, but what about this:

def traverse_cached(node, depth, cache={}):
    if node in cache:
        print('Already in cache: %s' % node.name)
        return
    cache[node] = None
 
    print('__'*depth + node.name)
    for ref in node.refs:
        traverse_cached(ref, depth+1, cache=cache)

Download this code: python_graphtraversal_cached.py

Brilliant! We don’t have to set up the cache neither inside the function body nor at the call site. This cache has to be initialized somewhere between the call and the function being executed, but somehow we’ve found a magical in-between, haven’t we!

Keyword parameters really are a tremendous boon, but unfortunately they will not save our skin this time. Do you know what happens if we do this?

>>> traverse_cached(a, 0)
>>> traverse_cached(a, 0)

Download this code: python_graphtraversal_cached_call.py

Read and weep:

>>> traverse_cached(a, 0)
a
__b
____c
Already in cache: a
__d
>>> traverse_cached(a, 0)
Already in cache: a

Download this code: python_graphtraversal_cached_output.py

Keyword parameters

Keyword parameters don’t do what you thought they did. You thought traverse_cached would get a new dictionary every time it was called without a cache argument. But that’s not what happens. (But isn’t python magical???)

In fact, it works like this. The cache keyword parameter gets initialized once and for all when the function is compiled. Since the value it is set to is an empty dictionary, the dict object is initially empty. But every time you call the function, it’s the same dictionary that gets passed in! :eek:

Third iteration

It’s come to this, I’m afraid. The problem is unsurmountable. We can’t not pass the cache parameter. But we definitely don’t want to do it at the call site. Which leaves no other option than…

def traverse_cached_fixed(node, depth):
    def rec(node, depth, cache):
        if node in cache:
            print('Already in cache: %s' % node.name)
            return
        cache[node] = None
 
        print('__'*depth + node.name)
        for ref in node.refs:
            rec(ref, depth+1, cache)
    rec(node, depth, {})

Download this code: python_graphtraversal_fixed.py

I really hate using an inner function just for this. It makes it messier, you have to route the arguments through an extra function call. But that’s the price you pay, I’m afraid.

Postscript

By now you must be fuming at the fact that I’ve come all this way while pretending that the function call didn’t have another parameter that was completely pointless and indeed deserving of the very same criticism.

Well, the difference is that you can set depth=0 as a keyword parameter in the function definition, so that you don’t have to pass it. And this doesn’t break anything. But why??

The answer is found when poking around python. A function is in fact a mutable object. Have a look:

def f(a=0):
    a = 1
 
>>> print f.func_defaults
(0,)

Download this code: python_graphtraversal_keywords1.py

func_defaults is the tuple that stores the values of the keyword parameters which have a default value. The value we see here is the integer 0, and it will not change no matter how many times we execute the function.

But collection types (and your own types too) are different:

def g(a=[]):
    a.append(1)
    print a
 
>>> print g.func_defaults
([],)
>>> g()
[1]
>>> print g.func_defaults
([1],)
>>> g()
[1, 1]

Download this code: python_graphtraversal_keywords2.py

Despite different, it actually works the same way (sorta). The binding of the keyword default is constant. But with an integer, the parameter is bound to the constant 0, whereas with “heavier” objects, it is bound to the object, but the object’s inner properties can very well mutate!!!

fake a dual monitor display!

Saturday, January 30th, 2010

Wouldn’t we all love to have a beefy workstation with at least two giant lcd monitors? Alas, I have a slim laptop with a small screen. And another laptop, almost 10 years old, albeit with a nice and large screen. I naturally prefer to use the newer machine for performance, but it also means making do with a small monitor.

I can tell you it’s a real pain to author latex documents this way, I can’t fit both the kile and evince on the screen at the same time. It wasn’t until recently that it hit me what I was doing wrong. There are three processes involved here:

  1. Document editor.
  2. Compiler (I run a loop that invokes make continuously in the background).
  3. Document viewer.

Come to think of it, this applies just as well to coding if you think “running the code” on the last step.

Well, X11 is a display server, for peet’s sake! So you have the editor on the workstation, but then you log in from the other laptop (with the larger screen) and run evince to display there.

Just do:

oldlaptop$ ssh -XYC workstation

Don’t ask me why -Y, I don’t know, but that’s how I get my ubuntu to allow remote connections.

a firewall in layman’s terms

Thursday, January 21st, 2010

Dealing with companies can be frustrating, because they like to appear opaque to the outside world. When you look on the website you find a page with contact information. You’ll find a phone number and an email address, and maybe more than one if it’s a large company with several departments. But they all point to the reception. Few companies are generous enough to give you direct access to their personnel with a listing of employees and their contact information.

So if there is a person you have to get to you have to do it through the reception. “Yes, I am blah and I need bleh and why don’t you just transfer me to the person I need to talk to, per favore!” It’s not a lot of fun, but this way whoever makes this decision to give out only the number of the reception can also decide who may and may not receive calls. And even on what conditions. If you say the magic word then, yes, you can get Frank on the line, otherwise not. And maybe Steve has been known to say too much and has a history of divulging information he wasn’t supposed to. So no, you can’t talk to Steve.

Well, this is the principle of a firewall. The reception screens calls with the discretion to reject or divert according to the protocol that has been instituted. Some people can be reached anytime, others only at certain intervals. Some are available depending on your request, and some are completely unreachable.

This picture, however, conflicts with the original meaning of the word firewall, which is a wall erected to stop a fire from spreading. Unconditionally.

detecting the os and distro on the system

Wednesday, January 6th, 2010

Linux, in all its diversity, sometimes lacks very basic mechanisms that you would be prepared to take for granted. For instance, picture this scenario.. you are on some system you don’t know well. Maybe it’s your friend who called you over to his house for help with something. Maybe it’s the computer lab where you don’t know the setup. Maybe it’s a remote session to a box you’ve never seen. Quick question: what’s it running?

Well, bash --version tells you about bash, ls --version tells you about coreutils and so on, you can keep going with that. uname will tell you about the platform and the kernel, useful stuff. What about the distro? Distros are diverse, sometimes it helps a lot to know what kind of environment you’re in. Well, strangely enough, there’s no standard answer to that.

They all do their own thing, if they do at all. Some distros use lsb_release to dispense this information. Others have files in /etc that you can check for, if you know what they are supposed to be called. So I decided to try and detect this. I’ve checked a bunch of livecds and it works on those distros that identify themselves somehow.

osdetect

# Author: Martin Matusiak <numerodix@gmail.com>
# Licensed under the GNU Public License, version 3
#
# <desc> Detect OS (platform and version) of local machine </desc>
#
# <usage>
# source this file in bash, then run `osdetect`
# </usage>
 
 
_concat() {
	local s="$1";shift;
	while [ "$1" ]; do
		s="$s $1"; shift
	done
	echo "$s" | sed "s|^[ \\t]*||g" | sed "s|[ \\t]*$||g"
}
 
_glob() {
	local file=
	local glob=
	local lst=
	while [ -z "$file" ] && [ "$1" ]; do
		glob="$1";shift;
		lst=$(ls $glob 2>/dev/null | grep -v /etc/lsb-release)
		if [ "$lst" ]; then
			file=$(echo "$lst" | head -n1)
		fi
	done
	echo "$file"
}
 
osdetect() {
	# ref: http://linuxmafia.com/faq/Admin/release-files.html
 
	local os=
	local release=
	local machine=
	if ! which uname &>/dev/null; then
		echo -e "${cred}No uname on system${creset}" >&2
		os="N/A"
	else
		os=$(uname -s)
		release=$(uname -r)
		machine=$(uname -m)
	fi
	if [ "$os" = "SunOS" ]; then
		os="Solaris"
		machine=$(uname -p)
	fi
	local platform="$(_concat "$os" "$release" "$machine")"
 
	# prefer lsb_release
	if which lsb_release &>/dev/null; then
		local id="$(_concat "$(lsb_release -i | sed "s|.*:||g")")"
		local rel="$(_concat "$(lsb_release -r | sed "s|.*:||g")")"
		local code="$(_concat "$(lsb_release -c | sed "s|.*:||g")")"
	elif [ -f /etc/lsb-release ]; then
		local id="$(_concat "$(grep DISTRIB_ID /etc/lsb-release | sed "s|.*=||g")")"
		local rel="$(_concat "$(grep DISTRIB_RELEASE /etc/lsb-release | sed "s|.*=||g")")"
		local code="$(_concat "$(grep DISTRIB_CODENAME /etc/lsb-release | sed "s|.*=||g")")"
 
	# find a file or another
	else
		local vfile=$(_glob "/etc/*-rel*" "/etc/*_ver*" "/etc/*-version")
		[ "$vfile" ] && local id=$(cat "$vfile")
 
		# distro specific
		[ "$vfile" = /etc/debian_version ] && [ "$id" ] && id="Debian $id"
	fi
 
	[ "$id" = "n/a" ] && id=
	[ "$rel" = "n/a" ] && rel=
	[ "$code" = "n/a" ] && code=
 
	local version="$(_concat "$id" "$rel" "$code")"
	[ ! -z "$version" ] && version=" ~ ${cyellow}$version${creset}"
 
	echo -e "Platform: ${ccyan}${platform}${creset}${version}"
}

Download this code: osdetect.sh

lazy function loading

Tuesday, December 8th, 2009

Even though bash is not my favorite programming language, I end up writing a bit of code in it. It’s just super practical to have something in bash if you can. I mentioned in the past how it’s a good idea to avoid unnecessary cpu/io while initializing the shell by doing deferred aliasing. That solves the alias problem, but I also have a bunch of bash code in terms of functions. So I was thinking the same principle could be applied again.

Let’s use findpkgs as an example here. The function is defined in a separate file and the file is source’d. But this means that every time I start a new shell session the whole function has to be read and loaded into memory. That might not be convenient if there are a number of those. networktest, for instance, defines four function and is considerably longer.

So let’s “compile to bytecode” again:

findpkgs ()
{
    [ -f ~/.myshell/findpkgs.sh ] && . ~/.myshell/findpkgs.sh;
    findpkgs $@
}

Download this code: lazyfuncionload_loadfunc

When the function is defined this way, the script actually hasn’t been sourced yet (and that’s precisely the idea), but it will be the minute we call the function. This, naturally, will rebind the name findpkgs, and then we call the function again, with the given arguments, but this time giving them to the actual function.

Okay, so that was easy. But what if you have a bunch of functions loaded like that? It’s gonna be kinda messy copy-pasting that boilerplate code over and over. So let’s not write that code, let’s generate it:

lazyimport() {
# generate code for lazy import of function
	function="$1";shift;
	script="$1";shift;
 
	dec="$function() {
		[ -f ~/.myshell/$script.sh ] && . ~/.myshell/$script.sh
		$function \\$@
	}"
	eval "$dec"
}

Download this code: lazyfuncionload_lazyimport

Don’t worry, it’s the same thing as before, just wrapped in quotes. And now we may import all the functions we want in the namespace by calling this import function with the name of the function we want to “byte compile” and the script where it is found:

## findpkgs.sh
 
lazyimport findpkgs findpkgs
 
## networktest.sh
 
lazyimport havenet networktest
lazyimport haveinet networktest
lazyimport wifi networktest
lazyimport wifiscan networktest
 
## servalive.sh
 
lazyimport servalive servalive

Download this code: lazyfuncionload_functions

So let’s imagine a hypothetical execution. It goes like this:

  1. Start a new bash shell.
  2. Source import.sh where all this code is.
  3. On each call to lazyimport a function declaration is generated, and eval’d. The function we want is now bound to its name in the shell.
  4. On the first call to the function, the generated code for the function is executed, which actually sources the script, which rebinds the name of the function to the actual code that belongs to it.
  5. The function is executed with arguments.
  6. On subsequent executions the function is already “compiled”, ie. bound to its proper code.

So what happens, you may wonder, in cases like the above with networktest, where several mock functions are generated, all of which will source the same script? Well nothing, whichever of them is called first will source the script and overwrite all the bindings, remember? It only takes one call to whichever function and all of them become rebound to the real thing. So all is well. :)