desktop hackery with grep

August 16th, 2007

Just as every self respecting Unix user knows (and every Mac user should know, but probably doesn't), grep is the tool supreme for finding stuff in text files.

Here I describe harvest, a similar tool to grep for searching in all kinds of files (and not).

Why on earth?
facebook_sign.jpgThe original problem was rather contrived, admittedly. I had been resisting the facebook bandwagon for the longest time, but finally a friend talked me into trying it. If you know facebook, you how the system works with adding "friends". And it's rather nice in how it imports your contacts from gmail and such. Of course, when you have your contacts elsewhere, you're left with a chunk of manual labor, searching&adding one-by-one. Not that it's a big problem, just a one time thing after all. But I was reluctant to undertake it, so I was clicking through facebook instead and found an import contacts from file feature, hm now that sounds more like it.

So I thought to myself in my university email account, I have a year worth of email history.. wouldn't it be satisfying to scan the whole thing and produce a list of email addresses I can import straight into facebook? Yes, I'm quite aware that my train of thought is somewhat off the beaten path much of the time.

In case you didn't know, email is just text. You may get a different idea when your email reader hides all the technical bits and just shows you the body of the message, but a stack of messages is just a bunch of text files, so there's no reason you can't treat them as any other text. So I went along and downloaded the whole thing, some 30mb. Now for the fun part.

So I need a tool that will scan the huge chunk of text and extract all the email addresses, and print them out in csv format. Preferably also remove duplicates from the list. Since I'm riding the ruby wave these days, it was the obvious choice, not least because I like how it handles regular expressions natively. So I hack up a script to do this, calling it harvest. It gets the data from the standard input, scans it for matches, and spits out the email addresses, very simple. And it works like a charm on my huge hunk of email data.

At this point you'll be wondering why on earth not use grep? Because to my knowledge grep only matches line-by-line, whereas I wanted something more general than that. And of course it's also the case that once you actually code it up, you have all the freedom you could ask for, rather than being limited to what grep does and doesn't do.

Can you make this thing go any faster?
make_it_go_faster.jpgLater on I realized that I can run harvest on just any file, and it would still work. Not that I had just discovered a new continent, strings already extracts all text strings from any file, including binaries. But the difference was I could search for things. So I found a nice test subject, pagefile.sys, which is Windows's swap file. I boot Windows once every few months, and when I do I rarely remember what I was doing last time. But apparently I had decided at some point that the swap file should be 1.5gb.

So I run harvest on it, while keeping an eye on things in htop. And ouch, harvest is consuming the entire file and reading it into memory. Next it's going to search for email addresses in a 1.5gb long string. Needless to say, that wasn't a success, the system started choking as it ran out of memory.

So I thought it would be better to buffer the file and read a chunk at a time. The only question is how do I still match for strings in a file of which there is only a chunk available? I wasn't exactly planning on matching super long strings, but then again there is the case where a string you want to find is part in the chunk currently in memory, and part in the next one, so how do you make sure you don't miss it? I tried an algorithm, and it was no good. It turns out for a long time I was barking up the wrong tree, and as a result I rewrote it about five times until I got it right. It is uncanny how the best solution is usually the simplest also.

To make sure it runs at an acceptable speed, I also experimented with the buffer size vis a vis speed and memory use, finding that a small buffer is actually better. When hunting for performance problems, it's often a good idea to run your app through a profiler just to be sure that it does what you think it will.

With a 10kb buffer (79s):

%   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 10.43     8.27      8.27   156723     0.05     0.05  Regexp#match
  8.73    15.19      6.92   618095     0.01     0.01  String#length
  6.87    20.64      5.45   153806     0.04     0.04  IO#read
  4.45    24.17      3.53   153810     0.02     0.02  String#+

This surprised me. I thought it should be spending far more time matching than say reading from disk. So I tried with a bigger buffer to see if I could marginalize disk io in the overall cost.

With a 10mb buffer (115s):

%   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 94.46   108.96    108.96     3057    35.64    35.64  Regexp#match
  2.52   111.87      2.91      152    19.14    19.14  IO#read
  1.31   113.38      1.51      156     9.68     9.68  String#+
  0.28   113.70      0.32       75     4.27    33.47  Kernel.require

This is more like what I expected, now almost all the time is spent matching. But it actually takes longer (and uses more memory as well, obviously), so there's nothing to gain by increasing the buffer unless the string we're searching for is so long that we need a buffer of megabytes. (Obviously, emails and urls are much shorter than that.)

To profile your script in ruby, try:

ruby -rprofile ninja.rb

The script now runs pretty fast, scanning the Windows pagefile in a couple of minutes, which I'm quite satisfied with.

More fun than a bag of chips, but useful?
the famous sliced breadI'm sure you're still wondering if the facebook scheme was a success. It wasn't. It turns out that out of all the emails harvested, a single one was found on facebook. As popular as the site is in Norway, apparently it doesn't have any Dutch users.

But since I already had harvest, I thought I would add an option to find urls as well, just for the heck of it. I also discovered that I could run it on any kind of file, not just text files. For instance, if you visited some cool site and forgot to bookmark it, it's probably still in Firefox's history file, so you can do:

harvest.rb --dat < ~/.mozilla/firefox/<profile>/history.dat

And not just files, either. To take a rather unexpected use case.. say you had an important email address, like for a job interview at the chocolate tasting lab, and you lost it.. well maybe it was swapped out at some point. Use harvest to scan your swap for email addresses:

cat /dev/hdXY | harvest.rb --email

And you can run that on any filesystem actually. I don't know how to access live memory in the same way, but that would be fun to try also. Things like zip files won't work, of course, because the text is scrambled, but otherwise (most of the time) you can read text out of any file whether it's a text file or not.

So is it actually useful? Not really.

But the useful observation is that your data is right there, and though you may not see it directly, it doesn't take more than this to actually look through it.

#!/usr/bin/env ruby
#
# Author: Martin Matusiak <numerodix@gmail.com>
# Licensed under the GNU Public License, version 3.
#
# revision 3 - allow spaces in urls
# revision 2 - introduce buffering to handle large files out of memory
# revision 1 - performance hacking: output entries immediately, only sort on
# emailcsv


require "optparse"


email = /([a-zA-Z0-9_\.-])+@(([a-zA-Z0-9-])+\.)+([a-zA-Z0-9]{2,4})+/m
url_orig = /([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9\/](([A-Za-z0-9$_.+!*,;\/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;\/?:@&~=%-]{0,1000}))?)/m
url = /([A-Za-z][A-Za-z0-9+.-]{1,120}:\/\/(([A-Za-z0-9$_.+!*,;\/?:@&~(){}\[\]=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9 $_.+!*,;\/?:@&~(){}\[\]=%-]{0,1000}))?)/m

pattern=url
joinlines=false
emailcsv=false
buffer_size=10*1024
hardlimit=100


## parse options
OptionParser.new do |opts|
	opts.on("--url", "url format") do |v|
		pattern = url
	end
	opts.on("--dat", "firefox history.dat format = \\\\n in urls") do |v|
		joinlines = true
	end
	opts.on("--email", "email format") do |v|
		pattern = email
	end
	opts.on("--emailcsv", "csv output (facebook contact import)") do |v|
		pattern = email
		emailcsv = true
	end
end.parse!


entries = []
previous = ""
while string = previous + STDIN.read(buffer_size).to_s and string.length > previous.length do
	partial = ""
	joinlines and string.gsub!(/\\\n/, "")
	while string and m = pattern.match(string) and m.size > 0 do
		m.end(0) == string.length and partial = m.to_s
		if partial.empty?
			if emailcsv
				entries << m.to_s
			else
				puts m.to_s
			end
		end
		pos = m.end(0)
		string = string[pos..-1]
	end
	if !partial.empty?
		previous = partial
	else
		if hardlimit < string.length
			previous = string[string.length-hardlimit..-1]
		else
			previous = string
		end
	end
end

# special stuff for csv email output
if !entries.empty?
	entries = entries.sort{ |a, b| a.downcase <=> b.downcase }.uniq
	puts '"Email Address","Formatted Name"'
	entries.each { |i| puts '"' + i + '",""' }
end

:: random entries in this category ::

3 Responses to "desktop hackery with grep"

  1. blandest says:

    I think you can scan live memory by using /dev/mem

    cat /dev/mem | harvest.rb –email

  2. [...] relax, perhaps all is not lost. A couple of weeks ago I went over how you can find stuff on disk by searching the raw data. The same *can* be done with memory. See, just because my message is gone and gmail doesn’t [...]

  3. [...] I don’t have a great deal of Ruby code at my disposal, but this should do the trick. How does scanning the raw filesystem for urls sound? The old harvest script actually does a half decent job of turning up a bunch of [...]