Archive for the ‘code’ Category

adventures in project renovation

March 9th, 2014

I'm inspired by how many great Python libraries there are these days, and how easy it is to use them. requests is the canonical example, and marks a real watershed moment, but there are many others.

It made me think back on various projects that I've published over the years and not touched in ages. I've been considering them more or less "complete". My standards for publishing projects used to be: write a blog entry, include the code, done. That was okay for simple scripts. Later on I started putting code on berlios.de and sourceforge.net. At some point github emerged and became the de facto standard, so I started using that too.

Fast forward to 2014 and the infrastructure available to open source projects has been greatly enriched. And with it, the standards for what makes a decent project have evolved. Jeff Knupp wrote a fabulous guide on this.

I decided to pick a simple case study. ansicolor is a single module whose origins I can trace back to 2008. I've seen the core functionality present in any number of codebases, because it's just so easy to hammer out some code for this and call it a day. But I never found it in a reusable form, so I decided to make it a separate thing that I could at least reuse between my own projects.

These are the steps a project is destined to pass through:

  • python3 support
  • pypi package + wheel!
  • readme that covers installation and "getting started"
  • tests + tox config
  • travis-ci hook
  • flake8 integration and fixing style violations
  • docs + Read the Docs hook

Not a single feature was added to ansicolor, not a single API was changed. Only two things really changed at the level of the code: exports were tidied up and docstrings were added. Python3 support was added too, but it was so trivial you'd have to squint to notice it.

The biggest stumbling block was actually writing the docs. As an implementor you tend to look at code in a completely different light than you do as a user of that code. Before starting on this I was thinking about how the API is a bit awkward in some places and could be improved. And how some of the functionality caters to a very narrow use case and maybe should be removed or to moved to a "contrib"-like place.

But as a potential user of a library that I just discovered I don't care about any of that. I want to be able to "pip install" it. I want to have some quickstart documentation so I can have running code in 2 minutes. That's how long I'll typically spend deciding whether this code is worth my time at all, so if the implementor is busy polishing the API before even putting out a pypi package they're wasting their time.

There is an interesting cognitive dissonance at play here. As an implementor I tend to think that the darkest corners of my code are those that most need documenting. Those are the ones most likely to bite someone. The easy stuff anyone can figure out. But as a user that's not how I see it at all. It's precisely the simplest functionality that most needs explaining, because most users have simple needs. If you do a good job documenting that you can make lots of people productive. By contrast, the complicated features have a small audience. An audience that's more sophisticated and more likely to help themselves by reading the code if need be.

Then there are the tools. I always found sphinx a bit fiddly. It's not really obvious how to get what you want, and it's permissive enough not to complain, so it takes a fair bit of doc hunting to discover how other projects do it. PyPI has a more conservative rst parser than github, so if you give it syntax it doesn't accept it renders your page in plain text. I ended up doing a number of releases where only the readme changed slightly to debug this. Read the Docs works well, but I couldn't figure out how to make it build from a development branch. It seems to only want to build from a tag regardless of the branches you select, so that too inflated the number of releases.

It takes a bit of time to renovate a project, but it's all fairly painless. All these tools have reached a level of maturity that makes them very nice to use.

re: for the man with many repos

November 13th, 2011

As it often goes, re is a tool that grew out of a bunch of shell scripts. I kept adding stuff to the scripts for a long time, but eventually it went beyond the point of being manageable.

The tool addresses three different issues:

  • Cloning/pulling multiple repos in one step.
  • Keeping repo clones in sync across machines.
  • Better handling of local tracking branches.

Listing repos

Let's start with a basic situation. I've cloned some of my repos on github:

$ ls -F
galleryforge/  italian-course/  re/  spiderfetch/

I run re list to scan the current path recursively and discover all the repos that exist:

$ re list                                                                                
[galleryforge:git]
    origin.url = git@github.com:numerodix/galleryforge.git
[italian-course:git]
    origin.url = git@github.com:numerodix/italian-course.git
[re:git]
    origin.url = git@github.com:numerodix/re.git
[spiderfetch:git]
    origin.url = git@github.com:numerodix/spiderfetch.git
> Run with -u to update .reconfig

It creates a configuration file called .reconfig that contains the output you see there. By default it doesn't overwrite the config, just shows you the result of the detection action. Pass -u to update it.

This file format is similar to .git/config. Every block is a repo, and :git is a tag saying "this is a git repo". (By design re is vcs agnostic, but in practice I only ever use git and the only backend right now is for git. It probably smells a lot of git in any case.)

Every line inside a block represents a remote (git terminology). By default there is only one. If you add add a remote in the repo and re-run re list it will detect it. But it will assume that origin is the canonical remote (more on why this matters later).

Pulling repos

Now let's say I want to pull all those repos to sync them with github. I use (you guessed it) re pull:

$ re pull                                                                                
> Fetching galleryforge
> Fetching italian-course                                                                
> Fetching re                                                                            
> Fetching spiderfetch                                                                   
> Merging galleryforge                                                                   
> Merging italian-course                                                                 
> Merging re                                                                             
> Merging spiderfetch                                                                    
-> Setting up local tracking branch ruby-legacy                                          
-> Setting up local tracking branch sqlite-try                                           
-> Setting up local tracking branch db-subclass                                          
-> Setting up local tracking branch next

As you can see it does fetching and merging in separate steps. Fetching is where all the network traffic happens, merging is local, which is why I think it's nice to separate them. (But there are more reasons to avoid git pull.)

What it also does is set up local tracking branches against the canonical remote. The canonical remote is the one listed first in .reconfig. So it doesn't matter what it's called, but it's a good idea to make it origin, because that's what re list will assume when you use it to update .reconfig after you add/remove repos.

It handles local tracking branches only against one remote, because if both origin and sourceforge have a branch called stable then it's not clear which one of those the local branch stable is supposed to track. I find this convention quite handy, but your mileage may vary.

If I later remove the branch ruby-legacy from github and run re pull, it's going to detect that I have a local tracking branch that is pointing at something that doesn't exist anymore:

$ re pull spiderfetch
> Fetching spiderfetch
> Merging spiderfetch                                                                    
-> Stale local tracking branch ruby-legacy, remove? [yN]

Scaling beyond a single machine

Now, re helps you manage multiple repos, but it also helps you keep your repos synced across machines. .reconfig is a kind of spec for what you want your repo-hosting directory to contain, so you can just ship it to a different machine, re pull and it will clone all the repos over there, set up local tracking branches, all the same stuff.

In fact, why not keep .reconfig itself in a repo, which again you can push to a central location and from which you can pull onto all your machines:

$ re list                                                                                
[.:git]
    origin.url = user@host:~/repohost.git
[galleryforge:git]
    origin.url = git@github.com:numerodix/galleryforge.git
[italian-course:git]
    origin.url = git@github.com:numerodix/italian-course.git
[re:git]
    origin.url = git@github.com:numerodix/re.git
[spiderfetch:git]
    origin.url = git@github.com:numerodix/spiderfetch.git
> Run with -u to update .reconfig

It does not manage .gitignore, so you have to do that yourself.

Advanced uses

Those are the basics of re, but the thing to realize is that it doesn't limit you to a situation like the one we've seen in the examples so far, with a single directory that contains repos. You can have repos at any level of depth, you can have .reconfigs at different levels too, and you can then use a single re pull -r to recursively pull absolutely everything in one step.

Get it from github:

nametrans: renaming with search/replace

March 25th, 2011

Keeping filenames properly organized is a pain when all you have available for the job is renaming files one by one. It's most disheartening when there is something you have to do to all the files in the current directory. This is where a method of renaming by search and replace, just as in a text document, would help immensely. Something like this perhaps:

nametrans_ss

Simple substitutions

The simplest use is just a straight search and replace. All the files in the current directory will be tried to see if they match the search string.

$ nametrans.py "apple" "orange"
 * I like apple.jpg    -> I like orange.jpg
 * pineapple.jpg       -> pineorange.jpg
 * The best apples.jpg -> The best oranges.jpg

There are also a number of options that simply common tasks. Options can be combined and the order in which they are set does not matter.

Ignore case

Matching against strings with different case is easy.

$ nametrans.py -i "pine" "wood"                                                        
 * pineapple.jpg -> woodapple.jpg
 * Pinetree.jpg  -> woodtree.jpg

Literal

The search string is actually a regular expression. If you use characters that have a special meaning in regular expressions then set the literal option and it will do a standard search and replace. (If you don't know what regular expressions are, just use this option always and you'll be fine.)

$ nametrans.py --lit "(1)" "1" 
 * funny picture (1).jpg -> funny picture 1.jpg

Root

If you prefer the spelling "oranje" instead of "orange" you can replace the G with a J. This will also match the extension ".jpg", however. So in a case like this set the root option to consider only the root of the filename for matching.

$ nametrans.py --root "g" "j"
 * I like orange.jpg    -> I like oranje.jpg
 * pineorange.jpg       -> pineoranje.jpg
 * The best oranges.jpg -> The best oranjes.jpg

Hygienic uses

Short of specific cases of transforms, there are some general options that have to do with maintaining consistency in filenames that can apply to many scenarios.

Neat

The neat option tries to make filenames neater by capitalizing words and removing characters that are typically junk. It also does some simple sanity checks like removing spaces or underscores at the ends of the name.

$ nametrans.py --neat                                                                    
 * _funny___picture_(1).jpg -> Funny - Picture (1).jpg
 * i like apple.jpg         -> I Like Apple.jpg
 * i like peach.jpg         -> I Like Peach.jpg
 * pineapple.jpg            -> Pineapple.jpg
 * the best apples.jpg      -> The Best Apples.jpg

Lower

If you prefer lowercase, here is the option for you.

$ nametrans.py --lower
 * Funny - Picture (1).jpg -> funny - picture (1).jpg
 * I Like Apple.jpg        -> i like apple.jpg
 * I Like Peach.JPG        -> i like peach.jpg
 * Pineapple.jpg           -> pineapple.jpg
 * The Best Apples.jpg     -> the best apples.jpg

If you want the result of neat and then lowercase, just set them both. (If you like underscores instead of spaces, also set --under.)

Non-flat uses

Presuming the files are named consistently you can throw them into separate directories by changing some character into the path separator.

Note: On Windows, the path separator is \ and you may have to write it as "\\\\".

$ nametrans.py " - " "/"
 * france - nice - seaside.jpg -> france/nice/seaside.jpg
 * italy - rome.jpg            -> italy/rome.jpg

The inverse operation is to flatten the entire directory tree so that all the files are put in the current directory. The empty directories are removed.

$ nametrans.py --flatten
 * france/nice/seaside.jpg -> france - nice - seaside.jpg
 * italy/rome.jpg          -> italy - rome.jpg

In general, the recursive option will take all files found recursively and make them available for substitutions. It can be combined with other options to do the same thing recursively as would otherwise happen in a single directory.

$ nametrans.py -r --neat 
 * france/nice/seaside.jpg -> France/Nice/Seaside.jpg
 * italy/rome.jpg          -> Italy/Rome.jpg

In recursive mode the whole path will be matched against. You can make sure the matching only happens against the file part of the path with --files or only the directory part with --dirs.

Special uses

Directory name

Sometimes filenames carry no useful information and serve only to maintain them in a specific order. The typical case is pictures from your camera that have meaningless sequential names, often with gaps in the sequence where you have deleted some pictures that didn't turn out well. In this case you might want to just use the name of the directory to rename all the files sequentially.

$ nametrans.py -r --dirname                                                              
 * rome/DSC00001.jpg -> rome/rome 1.jpg
 * rome/DSC00007.jpg -> rome/rome 2.jpg
 * rome/DSC00037.jpg -> rome/rome 3.jpg
 * rome/DSC00039.jpg -> rome/rome 4.jpg

Rename sequentially

Still in the area of sequential names, at times the numbers have either too few leading zeros to be sorted correctly or too many unnecessary zeros. With this option you can specify how many leading zeros you want (and if you don't say how many, it will find out on its own). Based on an old piece of code that has been integrated.

$ nametrans.py -r --renseq 1:3                                                           
 * rome/1.jpg   -> rome/001.jpg
 * rome/7.jpg   -> rome/007.jpg
 * rome/14.jpg  -> rome/014.jpg
 * rome/18.jpg  -> rome/018.jpg
 * rome/123.jpg -> rome/123.jpg

The argument required here means field:width, so in a name like:

series14_angle3_shot045.jpg

the number 045 can be shortened to 45 with "3:2" (third field from the beginning) or "-1:2" (first field from the end).

Get it from sourceforge:

things I hate about haskell

September 4th, 2010

As the title for this blog entry popped into my head today I realized I had been silently writing this for five years. I want to stress one word in the title and that's the word "I". It might be that what I have to say is no more insightful than people who don't like python because of the whitespace, or people who can't get over the parens in lisp. But I happen to believe that a subjective point of view is valid, because someone, somewhere had a reaction to something and it's crooked trying to pretend that "the system is immune to such faults".

My complaints have little to do with the big ideas in haskell, and those ideas can just as well be realized in another language. In fact, the way things are going, it's likely that haskell's bag of tricks will be the smörgåsbord for many a language to choose from. F# from Microsoft, clojure making waves, and lambdas that have reached even java. As James Hague wrote, functional programming went mainstream years ago.

Elitism

I don't mean elitism in the sense that you sometimes hear about lisp mailing lists, that the people are hostile to newbies with a harsh rtfm culture prevailing. I haven't met any nasty haskell people, it's the culture where the elitism is encoded.

I don't think I have to explain that if you, as a software engineer, meet with a client for whom you are building a product then you don't insist that the conversation be held in terms of concurrency primitives or state diagrams. If you want to get anything done, you have to speak the language of the customer. And that's a general principle: if you are the more expert party in the field that the conversation concerns, then it's your responsibility to bridge the gap. If you don't understand your customer, then you have a problem. And if you do understand him, but you don't care, then there's a word for that... oh yes, elitism.

What I mean by elitism in haskell is the belief that "what we have is so great that we're doing you a favor if we let you be a part of it." Trying to learn haskell is something like being a fly on the wall at a congress of high priests. There is theological jargon flying all around and if you happen to make out some of it, we'll let you live, that's how nice we are. There seems to be a tacit assumption in place that if you touch haskell then either a) you have been brought up in the faith or b) you have scrubbed your soul clean of the sins of your former life. That is to say if you're comfortable coding in lambda calculus then you'll have a smooth run, but if you code in any of the top 10 most widely used languages then good luck to you. "Enough already, do I have to spell it out for you: forget everything you know about programming and imprint these ideas on your brain."

Here, I'm pleased to mention Simon Peyton Jones who makes an effort to speak the language of his audience. And when you read his papers you get the impression that he's saying "okay, so maybe you're not a gray beard in this field, but I'll try to explain this so that you get something out of it". Alas, Simon is miles ahead of the pack.

Still, it's changing over time. There seems to be a new generation of haskell coders who don't live in that ivory tower and actually do sympathize with their fellow man. (To the mild chagrin of the high priests, one imagines.)  The first time I really knew that haskell was on the right track in the language ecosystem evolution was the appearance of Real World Haskell. It's the first book that I knew about that wasn't for the in-group, with a provocative title that seems to say "the language isn't about you crazy puritans in the programming language lab, it's about people trying to solve their everyday problems". Since then I have seen further evidence of this development, notably Learn You a Haskell, a "port" of Why's Poignant Guide to Ruby, the first book about Ruby that really took the newbie programmer seriously (and I'm pretty sure did wonders for Ruby adoption).

Math envy

One of the earliest things I ever read about haskell was "the neat thing about haskell is that functions looks almost like functions in math". I remember thinking "why is that supposed to be a selling point?" Unless you are actually implementing math in code (which is a pretty minuscule part of the programming domain), who cares? As I would discover later, it was a sign of things to come.

I read a couple of books recently about the craft of software engineering that are written in the form of interviews with famous coders. I recall that on several occasions there would be a passage more or less like this "but when I went back to look at the code I had written back then I was ashamed that I had used so many single character variable names". Exactly how many more articles and books have to get written until people stop writing code like this:

(>=>) :: Monad m => (a -> m b) -> (b -> m c) -> a -> m c
m >=> n = \x -> do { y <- m x; n y }

Is this an entry in a code obfuscation competition? (Is there some way you could obfuscate this further?) Why does reading code in haskell have to be some sort of ill conceived exercise in symbol analysis where you have to try to infer the meaning of a variable based on the position of the parameter? I'll be honest, I have no memory for these kinds of things, half way down the page I have long since forgotten what all the letters stand for. Why on earth wasn't it written like this?

(>=>) :: Monad tmon => (tx -> tmon ty) -> (ty -> tmon tz) -> tx -> tmon tz
f1 >=> f2 = \x -> do { y <- f1 x; f2 y }

Maybe that's not the optimal convention, but it's far better than the original. Almost anything is. (It's still pretty minimalistic by the standards of most languages.) Haskell prides itself on its type system. As a programmer you have that type information, so for goodness sake use it in your naming.

This kind of thing is normal in math: pick the next unused letter in the alphabet (or even better, the Greek alphabet!) and you have yourself a variable name. It's horrible there and it's horrible here.

Syntax optimized for papers

If you read about the history of haskell this is not really all that surprising. Before haskell there was a community of people doing research on topics in functional languages, but to do that they had to invent little demonstration languages all the time. So eventually they decided to work on a common language they could all use and that became haskell.

Haskell snippets look nice in papers no doubt. But how practical is this syntax?

main = do putStrLn "Who are you?"
          name <- getLine
          case M.lookup name errorsPerLine of
               Nothing -> putStrLn "I don't know you"
               Just n  -> do putStr "Errors per line: "
                             print n

Is your editor set up for haskell, does it know to indent this way? If not you're going to suffer. Haskell breaks with the ancient convention of using tabs for indentation, so if what follows a do or a case doesn't line up with a tab stop, you're out of luck. Unless your editor supports haskell specifically, that is. So all you people using some random editor: bite me.

Because of the offside rule, and haskell's obsession with pattern matching/case statements, all your code ends up in the bottom right of the screen in functions past a certain size.

And haskell is obsessed with non-letter operator names, because letters are so tiresome, right? Haskell people love to author papers in latex and generate pretty pdfs where all the operators have been turned into nice symbols (and Latin letters become Greek letters, more math envy). Sometimes when you're reading the paper you don't even know how to type it into an editor. It's almost like coding haskell is also an exercise in typesetting. "Yes, it's ascii, but just think how lovely it's gonna look in a document!"

ansicolor: because the view is better in colors

August 6th, 2010

If you're a coder you probably try to modularize everything to death on a daily basis. If not, your practices are a little suspicious. :nervous: Alas, it's not so easy to knock out something that I can say with confidence will be reusable in the future. One piece of functionality I keep reimplementing is output in colors, because it's hugely helpful to making things look more distinct. The first time I wrote this module I knew I would be using it again and I wished to make it nice and reusable, but I didn't know what the future uses would be. So I put that off until "later". In the meantime I copy/pasted it a couple of times into other projects. Shameful, but effective.

I finally got around to organizing these types of bits that have no specific place of their own into a new github repository, appropriately named "pybits". It holds the pretty printer and this rewritten ansicolor module, and it'll probably grow with the ages.

But to business. Anyone spitting out ansi escapes who has figured out the system knows it's trivial to make a color chart. So to keep the tradition going, here's proof that ansicolor is able to enumerate the colors:

ansicolor_chart

Notice that section at the bottom about highlighting colors. As you might be able to deduce by sheer logic, black and white are not great colors for highlighting something in a terminal, because they are typically used respectively as the background and foreground of the term (or vice versa). (The colors of a term can actually be anything, but black and white are the common ones. Ideally, code should detect this at runtime, but I don't know of a way to check for this. Besides, lots of programs [eg. portage] do make this assumption also.) So the highlighting colors are supposed to be useful for when you want to output a wall of text and mark something in the middle of it, so the user can spot it.

Suppose you are (as I have been in the past) developing a regular expression and you can't get it right on the first try (yeah, unbelievable, I know). Well, what you do is highlight the string so you can see how the matching worked out:

ansicolor_1regex

Regular expressions tend to get hairy (yes way) so it helps to compare their results when you're trying to unify two half-working variants into one. Adding a second regex will show the matches from both. Where they overlap the styling is bold:

ansicolor_2regex

Think of the green highlighting as a layer of paint on the wall. You then paint a layer of yellow on top, but you don't cover exactly the same area. So where the green wasn't painted over it's still green. Where the yellow covered it, the paint is thicker. And where the yellow didn't overlap the green it's just plain yellow.

Adding a third regex potentially produces segments highlighted three layers thick, so there the color becomes reverse.

ansicolor_3regex

And then bold and reverse.

ansicolor_4regex

ansicolor doesn't support background colors, but that's a product of my use so far, I've never needed it. I don't think they improve readability.

You will find this cutting edge technology in the repo: