Archive for the ‘technology’ Category

re: for the man with many repos

Sunday, November 13th, 2011

As it often goes, re is a tool that grew out of a bunch of shell scripts. I kept adding stuff to the scripts for a long time, but eventually it went beyond the point of being manageable.

The tool addresses three different issues:

  • Cloning/pulling multiple repos in one step.
  • Keeping repo clones in sync across machines.
  • Better handling of local tracking branches.

Listing repos

Let’s start with a basic situation. I’ve cloned some of my repos on github:

$ ls -F
galleryforge/  italian-course/  re/  spiderfetch/

Download this code: re_ls_repos.txt

I run re list to scan the current path recursively and discover all the repos that exist:

$ re list
[galleryforge:git]
    origin.url = git@github.com:numerodix/galleryforge.git
[italian-course:git]
    origin.url = git@github.com:numerodix/italian-course.git
[re:git]
    origin.url = git@github.com:numerodix/re.git
[spiderfetch:git]
    origin.url = git@github.com:numerodix/spiderfetch.git
> Run with -u to update .reconfig

Download this code: re_list.txt

It creates a configuration file called .reconfig that contains the output you see there. By default it doesn’t overwrite the config, just shows you the result of the detection action. Pass -u to update it.

This file format is similar to .git/config. Every block is a repo, and :git is a tag saying “this is a git repo”. (By design re is vcs agnostic, but in practice I only ever use git and the only backend right now is for git. It probably smells a lot of git in any case.)

Every line inside a block represents a remote (git terminology). By default there is only one. If you add add a remote in the repo and re-run re list it will detect it. But it will assume that origin is the canonical remote (more on why this matters later).

Pulling repos

Now let’s say I want to pull all those repos to sync them with github. I use (you guessed it) re pull:

$ re pull
> Fetching galleryforge
> Fetching italian-course
> Fetching re
> Fetching spiderfetch
> Merging galleryforge
> Merging italian-course
> Merging re
> Merging spiderfetch
-> Setting up local tracking branch ruby-legacy
-> Setting up local tracking branch sqlite-try
-> Setting up local tracking branch db-subclass
-> Setting up local tracking branch next

Download this code: re_pull.txt

As you can see it does fetching and merging in separate steps. Fetching is where all the network traffic happens, merging is local, which is why I think it’s nice to separate them. (But there are more reasons to avoid git pull.)

What it also does is set up local tracking branches against the canonical remote. The canonical remote is the one listed first in .reconfig. So it doesn’t matter what it’s called, but it’s a good idea to make it origin, because that’s what re list will assume when you use it to update .reconfig after you add/remove repos.

It handles local tracking branches only against one remote, because if both origin and sourceforge have a branch called stable then it’s not clear which one of those the local branch stable is supposed to track. I find this convention quite handy, but your mileage may vary.

If I later remove the branch ruby-legacy from github and run re pull, it’s going to detect that I have a local tracking branch that is pointing at something that doesn’t exist anymore:

$ re pull spiderfetch
> Fetching spiderfetch
> Merging spiderfetch
-> Stale local tracking branch ruby-legacy, remove? [yN]

Download this code: re_pull_stale_local_tracking.txt

Scaling beyond a single machine

Now, re helps you manage multiple repos, but it also helps you keep your repos synced across machines. .reconfig is a kind of spec for what you want your repo-hosting directory to contain, so you can just ship it to a different machine, re pull and it will clone all the repos over there, set up local tracking branches, all the same stuff.

In fact, why not keep .reconfig itself in a repo, which again you can push to a central location and from which you can pull onto all your machines:

$ re list
[.:git]
    origin.url = user@host:~/repohost.git
[galleryforge:git]
    origin.url = git@github.com:numerodix/galleryforge.git
[italian-course:git]
    origin.url = git@github.com:numerodix/italian-course.git
[re:git]
    origin.url = git@github.com:numerodix/re.git
[spiderfetch:git]
    origin.url = git@github.com:numerodix/spiderfetch.git
> Run with -u to update .reconfig

Download this code: re_list_reconfig.txt

It does not manage .gitignore, so you have to do that yourself.

Advanced uses

Those are the basics of re, but the thing to realize is that it doesn’t limit you to a situation like the one we’ve seen in the examples so far, with a single directory that contains repos. You can have repos at any level of depth, you can have .reconfigs at different levels too, and you can then use a single re pull -r to recursively pull absolutely everything in one step.

Get it from github:

full system encryption

Thursday, June 2nd, 2011

In the age of laptops I was thinking maybe it’s time I finally try encrypting my disk. I’ve never done it before, so before going for it I needed a small approfondimento.

The common strategy seems to be roughly:

  1. Leave /boot unencrypted.
  2. Encrypt the rest of the disk with LUKS. You then have dm-crypt that provides a mapping between the partition (according to the partition table) and the corresponding unencrypted block device, which becomes a node like /dev/mapper/nodename, depending on what you call it.
  3. Use /dev/mapper/nodename as the “physical” partition which you assign to lvm and make into a volume group.
  4. Create logical volumes in the volume group, so that each logical volume corresponds to what we used to call a partition on the old model, ie. /, /home, /var etc.

lvm is practical here, because you need at least two partitions, / and swap. You could just as well create multiple partitions sda5,sda6,… and encrypt each one, but then you’d have to unlock them individually on boot, which is hacky.

The setup is a bit involved and I would rather be spared the trouble of doing it manually. The ubuntu alternate install cd has a fully automatic feature that does this, using the whole disk. If you’re happy with the basic scheme, but you want more partitions or you want to size them differently, you can use the curses gui following a nice guide like this one.

So far so good, but now comes the inevitable question. Much like with compression you want to be able to not only compress but also to decompress. What if I screw up my boot sector or my fstab and I need to boot from a rescue cd? How do I mount /dev/sda5 now?

Mounting manually

First of all, in case we don’t have all the tools we need:

$ apt-get install lvm2
$ modprobe dm-mod

Download this code: rescue-encrypted_fs-prereq.sh

We obviously need to know what the physical partitions actually are:

$ fdisk -l /dev/sda
 
Disk /dev/sda: 32.2 GB, 32212254720 bytes
255 heads, 63 sectors/track, 3916 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000a30a5
 
   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          13       96256   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              13        3917    31357953    5  Extended
/dev/sda5              13        3917    31357952   83  Linux

Download this code: rescue-encrypted_fs-fdisk.sh

/dev/sda1 is /boot, that’s easy. Then we have /dev/sda5, which is the encrypted partition and mounting it directly will not work. This is where dm-crypt comes into the picture: we are going to unlock the partition, obtaining a block device that represents the unencrypted view of the partition.

$ cryptsetup luksOpen /dev/sda5 vg0
$ cd /dev/mapper
$ ls -lh
crw-------    1 root     root       10, 236 Jun  2 16:44 control
lrwxrwxrwx    1 root     root           7 Jun  2 16:47 vg0 -> ../dm-0

Download this code: rescue-encrypted_fs-cryptsetup.sh

We have decided to call the block device vg0, and it thus appears as /dev/mapper/vg0. Since we know this is an lvm volume group we use the lvm tools to figure out what it contains:

$ vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "vg0" using metadata type lvm2
$ vgs
  VG   #PV #LV #SN Attr   VSize  VFree
  vg0    1   2   0 wz--n- 29.90g    0
$ lvscan
  inactive          '/dev/vg0/swap' [1.86 GiB] inherit
  inactive          '/dev/vg0/root' [28.04 GiB] inherit

Download this code: rescue-encrypted_fs-lvm-vscan.sh

We have two logical volumes in there. But they are inactive, which means they are not visible under /dev, ie. we can’t mount them. To make them active:

$ vgchange -a y
  2 logical volume(s) in volume group "vg0" now active

Download this code: rescue-encrypted_fs-lvm-activate.sh

The device names listed before are now visible and we can mount them:

$ cd /dev/vg0
$ ls -lh
lrwxrwxrwx 1 root root 7 Jun  2 18:24 root -> ../dm-2
lrwxrwxrwx 1 root root 7 Jun  2 18:24 swap -> ../dm-1
$ swapon /dev/vg0/swap
$ mount /dev/vg0/root /mnt

Download this code: rescue-encrypted_fs-mount.sh

ironpython + gtk

Saturday, April 9th, 2011

nametrans_about_box I develop a bunch of tools for my own use that usually run in the terminal, for the simple reason that this maximizes their usability to me. It’s also true that usually a terminal program takes much less code and effort to write than a gui program.

But occasionally I want to make something of mine usable to the so-called “end users” that we keep hearing about. So what I was wondering about was: What would it be like to add a gui to an existing python program and run it on .NET?

How do you find out? You try it. This is something that I’ve been meaning to try for a while anyway.

Pros:

  • 2.3mb download

Cons:

  • gtk# dependency
  • sluggish gui

Gtk

This is the first time I’ve used gtk. I’ve written one program in the past using winforms and it was quite painful, so I wanted to try gtk.

To build the gui I naturally use glade, which is the best gui design experience ever. I then have to find out how to connect the signals, write the handlers and all that stuff.

Mono’s api docs on gtk seem quite complete, so you can usually find out what you need to know. I seem to remember looking at these in the past seeing nothing but “Documentation for this section has not yet been entered”, so perhaps they have improved this recently. I have to say they do use the beloved javadoc model of having the left frame list all the classes, which makes navigation a pain. Api doc presentation is not really known to be a cutting edge field, it really could use work. Then again, the presentation of the docs can’t help reflect the organization of the api to a large extent and if your api is organized as a kingdom of nouns rather than thematic modules then there’s probably little magic the api doc guy can wield.

Gtk is not as discoverable as it could be, due to the fact that they have invented some terminology. A window resize event is not a ResizeEvent, it’s an ExposeEvent. When I tried to discover this one I tried connecting all the signals that mention resizing in the api docs and none of them worked. Maybe this one belongs to a base class that wasn’t listed in the api doc I was looking at? Could be. (You see how api docs matter?)

In general though, I get the impression that the more you know about gtk the more it makes sense, it seems a really well engineered toolkit. This, of course, in stark contrast to winforms, where the more you know about it the worse it gets. I don’t know what gtk is like in c, but at least under .NET it doesn’t have the manual memory management “feature” of winforms.

Gtk is hugely superior in many ways and that I don’t think is even debatable. The layouting is great, internationalization support is painless, pango rendering is lovely and lets you use markup etc. One advantage that winforms has is the way it appears under Windows. But this I think is more a theming issue than a toolkit issue. Winforms looks nice under Windows and gtk looks mediocre. On the other hand, if I boot Mono’s livecd with the latest mono on it, and I run the samples there, gtk looks lovely and winforms looks atrocious, even though this is ostensibly supposed to demonstrate how well both of them run on the latest mono. So I take it from this that simply not enough people care about gtk on Windows to make it look better.

For the python coder, you can apply the api docs for c# directly, because the binding exposes the same names to python. This is very helpful. There are also quite a few pygtk examples around the web you can find if you need to discover how to do something in gtk.

IronPython

Let’s start with the positives. If you give it an existing python program and tell it where the standard library is, it will run your program. In this case I had to work around some platform nastiness and patch some code that’s using os (throwing false positive OSErrors on file renames) that works fine on CPython/Windows but doesn’t on IronPython/Windows. But in general it seems to work quite well.

Development experience

But if all you’re going to do is write code using the standard library, there is absolutely no reason to use IronPython. How do you access .NET apis from Python? No, seriously, that’s not a rhetorical question. IronPython ships with a number of assemblies, so where are the api docs for these? Are they browsable on the web? Are they viewable locally? The only thing I can see documentation wise is, next to IronPython.dll, something called IronPython.xml. It seems to be an aggregate of all the inline api comments in the source code. Is this it? Is this how you’re supposed to view the api doc? IronPython 2.7 does something interesting. It ships a file called IronPython.chm, but only the .msi installer package includes this. As I’m sure the IronPython guys are well aware, the installer is Windows only and the .chm file too is Windows only (although I think there is something called kchmviewer that could read it).

In fact, the .chm file is basically the original Python 2.7 documentation (written in .rst, which compiles to html), renamed to IronPython 2.7, and includes a few additional pages about IronPython. Except that instead of throwing the html files on the web or including them in the download, they have compiled them into a .chm which they only give to the Windows people. And this is supposed to be a platform independent runtime? Is this a prank?

IronPython ships some examples, but it’s not enough to figure out how to use the apis. To write a veritable IronPython gui program you need to produce an assembly, which has to be compiled with -target:winexe so that it does not show a terminal window on Windows. The simplest way seems to be to write a wrapper in c#, using the runtime hosting api, and execute your python in there. That’s what the Pyjama project uses (they also use gtk#), so I was able to look that up. But when I needed to set up sys.argv and sys.version using the hosting apis before launching the python I had no idea how. How are you supposed to find this out without documentation?

It turns out that you need to do something like this:

ScriptScope scope = Python.ImportModule(runtime, “sys”);
scope.SetVariable(”version”, “ironpython”);

Easy, right? Yeah, when you know it, of course it is. But what if you didn’t? How would you know that there is a function ImportModule in the Python namespace?

And how do you set sys.argv?

IronPython.Runtime.List lst = (IronPython.Runtime.List) scope.GetVariable(”argv”);

How would you know that IronPython.Runtime.List is the class that models a python list? You can use it like a python list:

lst.append(”program.py”);

In the end your best bet might actually be dir(). So how do you find out what’s in the Python namespace when coding c#? Use python:

clr.AddReference(”IronPython”)
import IronPython
print dir(IronPython.Hosting.Python)

You could actually use this method to recursively dir() the assemblies and produce a map of what’s in there, generating some kind of api doc in the end, but this is getting quite out of hand now.

At this point I was going to mention msdn and how that ought to host all the .NET related api docs you could ever want, but I see that the site is even worse than it used to be and I don’t have the palest idea of where to find anything anymore. I can only imagine that if you install the latest Visual Studio Ultimate Premium Professional it will enhance your hard drive with untold gigabytes of xml compiled to a proprietary binary file format worth of api docs, which you can only browse in the proprietary api doc reader, which internally just renders html anyway. But actually, since IronPython was ousted by Microsoft and since it has the stigma of being an open source project, it could well be that it would be considered too tainted to include its api docs in the Visual Studio-installed doc browser.

As your final recourse, I suggest you warm up your grep, git clone this IronLanguages repo and hope for the best.

Debugging

When I try this somewhere in my hosted python program:

v = None + 1

this is what I get:

Unhandled Exception: Microsoft.Scripting.ArgumentTypeException: unsupported operand type(s) for +: ‘NoneType’ and ‘int’

And the rest of the stack trace doesn’t mention the python code at all. Better than “Segmentation fault”, yes, but not by a whole lot. No filename, no line number.

This being the case I would strongly recommend developing your python with CPython first and then, once it’s in good shape, build the gui so you can do most of your debugging under CPython.

Upgrade cycle

I mentioned IronPython 2.7 earlier, but funny thing, I’ve never actually tried it. The 2.7 release is compiled against .NET 4.0, which is not supported by the mono packages in Ubuntu. I thought that since I have everything working with .NET 2.0/IronPython 2.6 I could just throw in the newer assemblies, run my makefile and compile it against .NET 4.0, on a mono release recent enough to support 4.0 (such as the mono livecd I mentioned before). But no deal. And if you just install .NET 4.0 under Windows it doesn’t come with a compiler or anything, so there’s no way to try it.

IronPython + Gtk

So how responsive is an ironpython/gtk application, even a tiny one like this?

Start up speed on my Ubuntu system is something like 4s, which is just about fast enough not to notice that it runs on IronPython instead of CPython. But Ubuntu uses gtk natively, so the libraries are in memory. On Windows it varies from a warm start of 6s to 20s+, for a cold start. In the worst case the first hello message from python appears after maybe 5s, so the rest must be accounted for by loading gtk.

Once it’s running, it can be a bit sluggish, in particular halting the program on Ubuntu sometimes seems to freeze the gui for a few seconds before it goes away. On Windows the issue (as usual) seems to be io. This program scans the filesystem whenever the input parameters change, which produces a list of files that is eventually displayed in the gui. But I’m not convinced that it’s any slower than CPython/Windows.

nametrans: renaming with search/replace

Friday, March 25th, 2011

Keeping filenames properly organized is a pain when all you have available for the job is renaming files one by one. It’s most disheartening when there is something you have to do to all the files in the current directory. This is where a method of renaming by search and replace, just as in a text document, would help immensely. Something like this perhaps:

nametrans_ss

Simple substitutions

The simplest use is just a straight search and replace. All the files in the current directory will be tried to see if they match the search string.

$ nametrans.py "apple" "orange"
 * I like apple.jpg    -> I like orange.jpg
 * pineapple.jpg       -> pineorange.jpg
 * The best apples.jpg -> The best oranges.jpg

Download this code: nametrans_applelist

There are also a number of options that simply common tasks. Options can be combined and the order in which they are set does not matter.

Ignore case

Matching against strings with different case is easy.

$ nametrans.py -i "pine" "wood"
 * pineapple.jpg -> woodapple.jpg
 * Pinetree.jpg  -> woodtree.jpg

Download this code: nametrans_ignorecase

Literal

The search string is actually a regular expression. If you use characters that have a special meaning in regular expressions then set the literal option and it will do a standard search and replace. (If you don’t know what regular expressions are, just use this option always and you’ll be fine.)

$ nametrans.py --lit "(1)" "1"
 * funny picture (1).jpg -> funny picture 1.jpg

Download this code: nametrans_literal

Root

If you prefer the spelling “oranje” instead of “orange” you can replace the G with a J. This will also match the extension “.jpg”, however. So in a case like this set the root option to consider only the root of the filename for matching.

$ nametrans.py --root "g" "j"
 * I like orange.jpg    -> I like oranje.jpg
 * pineorange.jpg       -> pineoranje.jpg
 * The best oranges.jpg -> The best oranjes.jpg

Download this code: nametrans_root

Hygienic uses

Short of specific cases of transforms, there are some general options that have to do with maintaining consistency in filenames that can apply to many scenarios.

Neat

The neat option tries to make filenames neater by capitalizing words and removing characters that are typically junk. It also does some simple sanity checks like removing spaces or underscores at the ends of the name.

$ nametrans.py --neat
 * _funny___picture_(1).jpg -> Funny - Picture (1).jpg
 * i like apple.jpg         -> I Like Apple.jpg
 * i like peach.jpg         -> I Like Peach.jpg
 * pineapple.jpg            -> Pineapple.jpg
 * the best apples.jpg      -> The Best Apples.jpg

Download this code: nametrans_neat

Lower

If you prefer lowercase, here is the option for you.

$ nametrans.py --lower
 * Funny - Picture (1).jpg -> funny - picture (1).jpg
 * I Like Apple.jpg        -> i like apple.jpg
 * I Like Peach.JPG        -> i like peach.jpg
 * Pineapple.jpg           -> pineapple.jpg
 * The Best Apples.jpg     -> the best apples.jpg

Download this code: nametrans_lower

If you want the result of neat and then lowercase, just set them both. (If you like underscores instead of spaces, also set --under.)

Non-flat uses

Presuming the files are named consistently you can throw them into separate directories by changing some character into the path separator.

Note: On Windows, the path separator is \ and you may have to write it as “\\\\”.

$ nametrans.py " - " "/"
 * france - nice - seaside.jpg -> france/nice/seaside.jpg
 * italy - rome.jpg            -> italy/rome.jpg

Download this code: nametrans_prefixasdir

The inverse operation is to flatten the entire directory tree so that all the files are put in the current directory. The empty directories are removed.

$ nametrans.py --flatten
 * france/nice/seaside.jpg -> france - nice - seaside.jpg
 * italy/rome.jpg          -> italy - rome.jpg

Download this code: nametrans_flatten

In general, the recursive option will take all files found recursively and make them available for substitutions. It can be combined with other options to do the same thing recursively as would otherwise happen in a single directory.

$ nametrans.py -r --neat
 * france/nice/seaside.jpg -> France/Nice/Seaside.jpg
 * italy/rome.jpg          -> Italy/Rome.jpg

Download this code: nametrans_recursive

In recursive mode the whole path will be matched against. You can make sure the matching only happens against the file part of the path with --files or only the directory part with --dirs.

Special uses

Directory name

Sometimes filenames carry no useful information and serve only to maintain them in a specific order. The typical case is pictures from your camera that have meaningless sequential names, often with gaps in the sequence where you have deleted some pictures that didn’t turn out well. In this case you might want to just use the name of the directory to rename all the files sequentially.

$ nametrans.py -r --dirname
 * rome/DSC00001.jpg -> rome/rome 1.jpg
 * rome/DSC00007.jpg -> rome/rome 2.jpg
 * rome/DSC00037.jpg -> rome/rome 3.jpg
 * rome/DSC00039.jpg -> rome/rome 4.jpg

Download this code: nametrans_dirname

Rename sequentially

Still in the area of sequential names, at times the numbers have either too few leading zeros to be sorted correctly or too many unnecessary zeros. With this option you can specify how many leading zeros you want (and if you don’t say how many, it will find out on its own). Based on an old piece of code that has been integrated.

$ nametrans.py -r --renseq 1:3
 * rome/1.jpg   -> rome/001.jpg
 * rome/7.jpg   -> rome/007.jpg
 * rome/14.jpg  -> rome/014.jpg
 * rome/18.jpg  -> rome/018.jpg
 * rome/123.jpg -> rome/123.jpg

Download this code: nametrans_renseq

The argument required here means field:width, so in a name like:

series14_angle3_shot045.jpg

the number 045 can be shortened to 45 with “3:2″ (third field from the beginning) or “-1:2″ (first field from the end).

Get it from sourceforge:

everything that is wrong with bookmarks

Tuesday, February 15th, 2011

The history of bookmarks is one of those tragic stories in technology. When bookmarks were first introduced (by Netscape? or maybe it was Mosaic?) they were a huge step forward. Trying to memorize urls or writing them on paper clearly weren’t methods that worked well. The idea — and so simple too — that the browser could remember the urls for you was the perfect solution.

Sadly, since the “big bang” of bookmarks there have been precious few new explosions.

The basic problem

The introduction of bookmarks, welcome as it was, created a problem that remains with us today. Once you start bookmarking pages, you inevitably produce a list of bookmarks that becomes more chaotic and less useful the longer it gets. Sure, a list of bookmarks is useful when you can look at it and quickly know what is there, and when you can see the bookmark you want to load right now.

But when you start having to scroll the list, and not only that but use the PageUp/PageDown keys to scroll the list quicker, it’s a good sign that it’s getting out of hand.

A collection of bookmarks is all well and good, but it needs some kind of structure superimposed on it to remain effective.

The bookmark toolbar

The bookmark toolbar encapsulates the insight that some bookmarks are more important than others and offers a number of improvements:

  1. Allows marking some bookmarks as more important/more frequently used.
  2. Gives them better visibility.
  3. Provides quicker access to them (by not having to go into the bookmark menu).

The bookmarks are displayed on a toolbar, either as links or as links-within-folders.

badbookmarks-toolbar

Despite how useful this feature is, browsers have historically treated it as something of a marginal feature. Firefox, for instance, used to view the launcher toolbar as just another folder in the bookmarks collection (albeit with a special name, like “Personal Toolbar Folder”), which you could accidentally rename or delete, and then it wouldn’t show up as a toolbar anymore.

Another thing that matters a lot to the usability of the toolbar is the drag-and-dropability of bookmarks onto the toolbar, into the folders, and from one place to another. Even today, for instance, in Google Chrome I can’t reorder the items in a folder on the toolbar without opening the Bookmark Manager.

Import/export (and the silo)

It hardly needs stating that once you have a bookmark collection in one browser, you don’t want to manually recreate it if you decide to use another one. Browsers have historically been reluctant about giving out their bookmarks. All too often, despite making a show of offering to import your bookmarks from another browser, the import mechanism has bordered on the useless.

First and foremost, every browser vendor since the Ice Age has been eager to supply you with a tasty selection of bookmarks that he was convinced you would love. Importing your own bookmarks, therefore, could at best be seen as a supplement. No browser would ever just take your existing bookmarks and overwrite its own vendor-supplied ones, which is exactly what the user wants. Instead, it would stash them somewhere in the bookmark collection, well out of sight. Any additional metadata that was implicitly stored in your bookmarks would often be lost, like the order in which they were listed.

In particular, the browser would make no bones about trying to find out if you have a bookmark toolbar in there, and replace it with its own (despite the browser having a toolbar feature that worked exactly the same).

badbookmarks-ff-toolbar

badbookmarks-bad-import

So having done an “import”, you would typically have to manually organize your bookmarks, nuke the stupid vendor bookmarks, and sometimes you’d even have to recreate the folder structure of your bookmark toolbar, all before you had been able to achieve the same state as in your other browser.

This kind of situation is standard silo behavior. By making the import feature so mediocre, the browser vendor would pretty much ensure that the user would not switch browsers without paying a high price for it. Simply using more than one browser on a daily basis, with an easy way to manage your bookmarks across them by a quick sync, is just not realistic.

Bookmark sync (and more silo)

A way of keeping your bookmarks synced across computers has been a no-brainer feature since the era when people started accessing the web both at home and at school/work. And yet, a working synchronization feature is a pretty recent development in bookmarks. I recall some failed attempts with Firefox extensions in the remote past, but at last it is here.

A number of browsers have a sync feature now, and it’s a big step forward in bookmarks. Even if your bookmark collection is a mess, you can at least have the same mess all over the place. Clean it up and it’s clean everywhere.

And yet, bookmark sync is yet more silo behavior: you can sync your bookmarks from Opera to Opera, but not from Opera to Firefox. The fact that bookmark sync doesn’t do the same half-assed job of the import feature might seem strange, but the motive is very obvious:

  • ability to have your bookmarks up to date in our browser on every computer = good for the vendor
  • ability to have your bookmarks up to date in every browser = bad for the vendor

Browser vendors know very well that if bookmark sync worked as poorly as bookmark import, they couldn’t sell it as a feature, because noone would use it.

Bad page titles

Strangely enough, I’ve come all this way without mentioning just about the most glaring problem that bookmarks have: bad page titles. Since the name of the bookmark is simply the title of the page in 99.99% of the cases, the title ought to be both descriptive and concise. Instead, we have historically seen that web creators much prefer titles that are variations on this theme:

The Excessively Long Title Of My Website Which Is Very Nice Indeed: Section Title: Article Title

With titles like that, all too often you can’t even see the title of the article in your bookmark list, because the text is truncated somewhere in the middle.

Quite apart from the length problem, web sites often prefer to give articles catchy titles rather than descriptive ones. So with a title like:

Something Amusing That Makes You Think About What The Page Really Is About

you have the short term benefit of being amused at the cost of the long term benefit of a descriptive title.

Bad metadata

Bookmarks belong eminently to the category of things where the number of items is so large that it would be great to have a way of automating the retrieval/organization of the items.

Yet, despite announcements from Mozilla in the past that they would soon obliterate the old model of bookmarks-as-a-list, and introduce a new and all-conquering search based approach, we still have the list. The fact is that bookmarks don’t contain enough metadata to make search useful. A bookmark has two pieces of data:

  1. The name of the bookmark.
  2. The url.

Sure, some browsers give you the option to store other things too, like tags, but if we all agree that the user can’t be bothered to keep their bookmarks organized, let’s not pretend he will actually input any of the optional stuff. And even if he does, 90% of his bookmarks won’t have any other data associated with them, so we’re back to the short list above.

So why doesn’t search make sense? Because much too often neither the title nor the url contains any of the keywords that you would want to use in order to find this bookmark. Web sites don’t pay too much attention to titles, and the real data that would be useful to search is the page itself, which is not available.

Bookmark oblivion

What should seem ironic is that bookmarking a page often has the effect of not bookmarking it at all. The bookmark is saved somewhere in the long list and then never seen again, either because the list is too long to really bother looking at beyond the most recently added items, or because the page title is useless, or because it was bookmarked “for future reference”, and by the time we return to this topic we’ve forgotten about the bookmark.

We tend to grow pretty oblivious as to what’s in our bookmarks. Over time, some pages expire, others drift out of our sphere of interest, yet the bookmark collection doesn’t get updated.

Just about the most obvious feature a browser might offer is to try loading the bookmarks from time to time, in the background, and marking the ones that return 404.

Another idea might be to offer to list bookmarks according to how often they are loaded, making the never used ones fall to the bottom of the list. Applying this to the bookmark folder might be especially useful, so the user doesn’t have to reorder the bookmarks to make the frequent ones quicker to reach.