Archive for August 17th, 2006

Project Newman :: An introduction

Thursday, August 17th, 2006

I have been posting on Xtratime.org (a football forum) since sometime in 2000. The site has been through a lot in that time, but one thing that hasn’t changed is a member called Carson35 posting news stories from various football news sites with astonishing regularity. He now has 74k+ posts, far more than anyone else, and most of those are copy/paste jobs of news stories. Over the years he’s become a celebrity for his undaunting commitment to bring the news, decorated with a special title – XT Post Number King. Some have jokingly suggested that he’s a robot, programmed to do this one thing.

So I thought it would be fun to try and imitate Carson, as a tribute if you will. And, of course, I mean computationally, in an automated manner. The purpose of such a thing would be to satisfy my curiosity in certain areas:

  • how hard would it be to imitate Carson35 by posting news articles?
  • how closely could I be able to reproduce his activity?
  • what are the biggest challenges in making this work without any user input?
  • just how automated could it be done?
  • could I build a bot that would be accepted (or at least not hated) by other members for spamming?

The project was first dubbed Carson36, as an increment of the Carson we all know. But then Erik suggested Newman – for a bot that brings the news – and I couldn’t resist that name. :D

newman.jpg

While this is a technical topic, I’ll try to do something I’m not good at – explain it in simple terms. That’s what good technical writers do, and it would be nice to imitate.

This entry is part of the series Project Newman.

charset wars

Thursday, August 17th, 2006

Have you ever opened a web page and all you could see was garbled text? That was a charset conflict. The page had been written in a charset other than in which it was displayed to you. If you look at this page, the Norwegian characters should display correctly, but if you do this:

charset.png

(ie. change the charset manually), then non-ascii characters will mess up. Why? Because the file was written as utf-8 text, but is being read in iso-8859-1 encoding. So characters found in utf-8 which are not found in iso-8859-1 are “improvised” (or in other words – wrongly translated) by the function that reads the text. Since utf-8 uses two bytes per character and iso-8859-1 only uses one, the characters that are ‘mis-translated’ show up as two characters instead of one.

This is usually not a problem, because most websites (and most half-conscious web coders) have the decency of setting the charset in the header of the page, like so:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

So much for the web. What’s worse is when you get charset conflicts in terminals. Most modern linux distros now ship in full utf8 mode, that is applications are set to use utf8 by default to avoid all these problems. But then I log into a server and use nano or vim to edit files (if need be – emacs), and I get in trouble. The text I write (my terminal controls what characters are sent to the server), is in utf8. The server will most likely not support that (because some of these server distributions are ancient and do *not* use utf8 by default), so when I type the text in nano and save it, if I use non-ascii characters, the text will get garbled. vim supports utf8, so the problem is much reduced. But in nano, I basically have to save, then open the file again to see where the bugs are. This has to do with how text is handled, characters are counted left to right, so if I type a utf8 character (which is two bytes), and I try to erase it, nano will just erase one byte. So “half” the character is still there. And so on and so forth. Very annoying, I tell you.

So why bother with utf8? Because utf8 (and unicode in general) was designed to solve all these charset conflicts. ISO 8859 is a legacy standard, and with its various extensions it supports many different languages. But you can only use one at a time, so if you write text in French in one file, you cannot also use Russian text in there, the charset won’t support both. Enter utf8, which supports pretty much _everything_. But as long as we still have piles of legacy systems that aren’t designed to handle utf8 (or don’t use utf8 by default at least), we will continue to experience these problems forever. Standards are only salvation insofar as they are applied. Correctly, consistently and universally. That much we have already learnt from IE vs the world in terms of web page rendering.