Archive for the ‘newman’ Category

Project Newman :: An evaluation

August 29th, 2006

The thing about a project like Newman is that it's basically impossible to make it work perfectly. It has a difficult job, because there are so many potential sources of error. Servers may go offline, connections may fail, article formats may change and so on. It is as good as impossible to guarantee that Newman will do the right thing, because at the end of the day we are trying to analyze text, and computers are not good at doing that. Just look at spam filters - they have been improved upon for years, but everyone is still getting spam. Much less than before, of course, so the filters are definitely useful. And Newman too makes mistakes, but it does still succeed quite often.

Newman has been posting on Xtratime.org under the username Carsonne, a French female impersonator of Carson35's it would seem. Carsonne averages about 15 posts a day since July 30, that is a little over 350 posts in all, 350+ news stories posted. While I haven't been keeping score to present statistical numbers, I have kept a close eye on Carsonne and I would estimate that upwards of 90% of the stories posted were correctly parsed, formatted and classified. In fact, I recall about 10-15 misposts of the ones I've seen (which I think is most). And that is an error rate no human poster would have, Carsonne at an estimated 95% success rate is at least an order of magnitude below a human poster (ie. I would claim that a human poster would have a >99.5% success rate at copy/pasting and classifying stories - less than 2 misposts in 350).

What about user input, then? Well, unfortunately Newman does present a certain configuration cost, not everything can be automated. In particular, finding channels is something that would be wonderful to automate, given how quickly the forum climate changes. Newman also requires that sources be configured (and if need be - updated) for the parsing to work. Of course, once that is in place, Newman can post at will. So that is still quite a limited set of abilities.

The screenshot below shows a typical run of Newman. Quite a few stories were fetched, some were selected for posting, and then posted. It also shows how Newman is fault reliant - a parsing error was handled gracefully, as was a timeout from the forum web server.

newman_running.png

After 20+ days on the forum, Carsonne has been active long enough to stir up some reactions about "her" posting of news. Carson's long tenure has paved the way for posters like this, so Carsonne is seen by most as just another compulsive news poster. "She" has taken some heat over posting news in the wrong place (wrong classification), but beyond that it has been no worse than Carson gets daily.

So what have we learnt?

As it often is, it seems that Project Newman has yielded more questions than the number of answers it has given. Sure enough, it isn't too hard to automate posting on a forum, it isn't too hard to fetch stories from the web and parse them, it certainly isn't hard to automate this out of any human's ability to keep up. But it is hard to decide what text means, it is hard to decide which story is relevant to what thread, it is hard to decide whether a word in a sentence is a name and so on.

The question is just how to do these things in a reliable way?

Thus endeth Project Newman. Download the code from the code page if you're interested.

This entry is part of the series Project Newman.

Project Newman :: Further ideas not implemented

August 27th, 2006

Newman was meant to be a simple design that wouldn't take too long to build (it took me about 2 weeks of afternoons to write) and just focus on the issues that are simple to handle without making a complicated mess of it. There are all kinds of ways in which it could be improved and I'll mention some of the ideas that I decided to leave out.


Channel limits

Even in threads where it is deemed acceptable to post news articles, it isn't civil to post 20 articles a day in one thread. To overcome this problem, one could limit the number of stories that may be posted in one channel per day. This would require another cache, to keep track of how many stories have been posted in which channel already today.

This is something I thought I would do, but I haven't written it, because it doesn't seem to be a problem. Newman can post up to 5 stories in one thread (which is a bit much), but it will only do this in a couple of channels (the Real Madrid one tends to see a lot of news). So the amount of news posted in a channel reflects the amount of news about that club that day (and that seems quite fair).

A bandwidth monitor

Newman deals in receiving and sending data to web servers. This generates a fair bit of traffic. Web hosts tend to be somewhat sensitive about one client making lots of connections, because bandwidth isn't free. While Newman does not operate on a mass level and will not overload any server by itself, it may still be useful to know just how much bandwidth it generates.

I haven't looked deeply into this, so I'm not sure how to do it. Web servers send a Content-Length field in http headers, but this will only account for the traffic received. Perhaps a byte count of the uuencoded form input that Newman sends could be used to track outgoing traffic. But even this would not account for the low-level overhead in establishing socket connections.

User agent scrambling

Newman identifies itself as Firefox, so it looks like just another human client connecting to the web server. But the reporter retrieves a list of stories and then sequentially retrieves every story. This behavior is too systematic to be human (noone is interested in *every* story), and so a web admin who keeps track of logs, who sees that the user agent is always the same, will assume it's the same client making these connections.

To escape that kind of detection, we could easily scramble the user agent string and just make Newman report a different user agent for every connection. This would make it look like the connections are coming from a shared ip address, but from different clients (for instance different people in a school or company).

Client source scrambling

Closely related to the point above, a web admin that wants to block Newman can only do so by blocking the ip address the connections are coming from. This could be overcome by making Newman connect through anonymous proxies (or maybe use the Tor network?). And connect everytime through a different proxy - that would make blocking ip addresses a lot more difficult.

Running stories through online translator

This is a very silly idea, but sometimes people post stories from a non-English source, because the story talks about something that the English language sources haven't caught up with yet. Since Xtratime.org is an English speaking forum, most people don't understand these articles. So the person who posts it will sometimes post an automatically generated translation of the text, using the notoriously bad Altavista Babelfish.

And so it could be an extra feature for Newman to retrieve articles from gazzetta.it or Marca, translate them with Babelfish, and post them. This would further reduce the quality of the articles Newman is posting, but it certainly is something that human posters tend to do.

This entry is part of the series Project Newman.

Project Newman :: The scheduler

August 26th, 2006

Now that we've covered the reporter, the editor and the publisher, we have a functional Newman that can actually post stories. I set up Newman to run in a cron job (ie. at set intervals) to run every three hours, but then it occurred to me that it isn't human behavior to post at 9am, then at 12am, then at 3pm and so on, it just doesn't look real. And if someone were to keep an eye on Newman, they might notice that it always posts at regular intervals, which looks odd. (The point here isn't so much to fool people into believing that Newman is real, it is just to make it so that it seems to exhibit a lot of human qualities.)

So I thought why not add a scheduler to decide when Newman should run. The scheduler runs as a daemon (ie. an application that runs 24/7 in the background, but only does actual work whenever it is called upon). So the scheduler is given a time interval (for instance: 3 hours), and then it generates a random number between 0 and 3 hours. That's when Newman is going to run. And then it goes to sleep until that time. So if I start the scheduler at 10am, give it an interval of three hours, it may decide that Newman should run at 11.45. So then it goes to sleep until 11.45 and then it runs Newman.

newman_scheduler.png

The advantage of this method is also that if the scheduler runs Newman and Newman crashes, it won't make the scheduler crash. So the scheduler will still keep running and will again run Newman at the next interval. I've also made sure that the scheduler waits for Newman to finish, so that if Newman is taking a lot time to complete and the next interval is in 5 minutes, Newman will not be started again until the current execution is finished.

This entry is part of the series Project Newman.

Project Newman :: The publisher

August 25th, 2006

Compared to what we've talked about so far, the publisher is a pretty simple piece of the puzzle. It receives a list of stories, each one assigned to one or more channels, and simply posts them on the selected target, that is Xtratime.org. For this to work, we must first prepare an account on the forum for Newman. Having done that, the publisher will log in the user, open the thread where the story should be posted and simply post it, adding some vBcode formatting to the text. The image below shows what a typical news post looks like.

newman_publisher.png

While Newman posts articles on Xtratime.org, which is a vBulletin forum, it could just as easily post them on any other website, it's just a matter of reading the html code sent to us from the server and submitting html forms with the correct data. I won't bore you with the details. Of course, there is always a chance that the server may drop the connection while the posting is in progress, or the connection may fail in the first place. In these cases, the publisher will report the error, but it will try to post the next story as if nothing happened.

In the event that vBulletin rejects the post for any reason, Newman tries to read this error and report it. It may be that two stories have been posted too close together (vBulletin has a limit for how often posts can occur by the same user), perhaps something else didn't go to plan. In any event, it handles these errors gracefully without crashing.

In early stages of development, I was testing Newman on a test forum I set up, which was protected with a password. Newman can use basic http authentication to access web sites in that way.

This entry is part of the series Project Newman.

Project Newman :: The editor

August 22nd, 2006

The editor is basically the "brain" of Newman. It's the most complicated part, because it has to handle the most logic. Broadly speaking, it is the editor's job to figure out which news articles to post where. Once the target is set (ie. Xtratime.org), the editor has to figure out whether each of the articles delivered by the reporter should be published in any of the channels (ie. threads) we have available. The illustration below shows the editor's role in the chain.

newman_schematic.png

Finding channels

But before we dive right into it, a small note on channels is in order. Let's start with a rather more basic question: How does a human post news articles? Carson35 will post articles wherever the particular story is relevant - either to the thread at large, or to the last few posts specifically. The question is whether Newman can imitate this behavior. Xtratime.org is divided into lots of forums, one for each club, where threads about that one club are found. Some of these forums have special threads active all through the season, like for instance a "transfer rumours" thread. So what should Newman do to decide where to post an article? It could iterate over the threads in a certain forum to figure out "what this thread is about". But that is rather difficult to do, for a bot. Given just one sentence, how do you mechanically establish such a terribly human observation - what it's "about"? If Newman could do this, it would be quite clever. But, I must admit that I can't think of a method.

So, the approach I took was to input a list of channels manually selected. I can't think of a way to establish what a post is about, or a thread is about, or if a certain story should be posted in a specific thread. So I had to fall back on a human method and simply give Newman a list of threads it can use to post articles in.

The subject filter

So I've built Newman to do all the tedious work for me, but I've already had to produce the channels myself, it would be nice if Newman could do some work too now. Given the channels, we now have a bunch of stories and a bunch of channels - how do we match them? I've selected my channels so that I have one channel per club forum. One thread to post news articles in is already enough to aggravate sensitive forum people, so I'm not going to push my luck. This also means that I have to figure out if a certain story is about a certain club, or not. This I call the subject filter (ie. to establish the subject of the story).

If you think this is already getting hazy, unfortunately it doesn't get any better. I'm not at all interested in trying to deduce the meaning of sentences in English (this would likely take me around forever to finish). Instead, I'm limiting myself to just looking at individual words. So while a complete analysis would reveal that the phrase "the royal club" may be talking about Real Madrid, I won't be getting into that. I will limit myself to looking for just words. Now, it may seem prudent that in order to establish that a story talks about a certain club, it would be helpful to look for the names of players who play for that club. But players change clubs all the time, so the list of players for every club would have to be updated every so often (and remember: we're trying to minimize the human input here). Worse still, half the stories in the papers about Real Madrid discuss possible signings of players who belong to other clubs. To consider adding all players linked with a club to the list of players at every club would be mad.

So, the only thing I will use is the one name that doesn't ever change: the name of the club itself. In its many incarnations. So a story that mentions "Real Madrid" is one that we probably want to classify as eligible for the Real Madrid channel. But it could also just mention Real or Madrid on their own, so we have to consider that too. But then again, Real could also refer to Real Zaragoza, so then "Real" should be a weaker match than "Real Madrid". As you can probably see by now, this is going in the direction of a spam filter: searching for words and scoring them according to certain rules. In addition, it struck me that the position of a word in a text tends to mean something (if it's in the beginning of the story, or in the title, it should give a higher score). Finally, the length of a story matters as well. In a typical story of the kind we like to analyze, the name of a club may appear 2-3 to 5 times. In a very short story, it may only appear once. In a very long story, it may appear more times, in the guise of nicknames and phrases like "the royal club". So a long story may mention Real Madrid 3 times, but may actually be about Barcelona, so we will not give "Real Madrid" as high a score as it would get in a shorter story.

Getting a bit hazy, is it? I thought it might. The subject filter works fairly well in most cases. There have been occasional whoopsies, like a story about Arsenal de Sarandí posted in the Arsenal (the English one) forum. And there was a story about Luis Valencia matching the Valencia channel. I have tried to filter this by searching for Valencia as part of a name (ie. as part of a sequence of capitalized words) - and penalizing that match under suspicion for being the name of a person - but this kind of thing is very imprecise.

The topic filter

So far so good (do I sense hesitation?). For some channels it is enough to use the subject filter. But for others, those which have to do with transfer rumours, we should also decide whether a certain story is about possible transfers. (Incidentally, most soccer news is.) For this I created a separate filter. So that a story matching on "Real Madrid" would then have to pass through the topic filter to see if it seems to be transfer news. For this I used a word list - a list of words that are highly relevant to transfers, such as contract and offer. Then I scored these words just like I did with the subject filter and set the threshold after some trial and error to filter stories fairly reliably. In any case, I would rather filter out more stories wrongly than to post irrelevant news (just like a spam filter would rather allow more spam than to risk losing your non-spam email). This way, some transfer news didn't make the cut, but afterall there are enough stories published everyday to suffice.

Is that all?!?

So finally, after running every story through the subject filter, and if need be the topic filter, I would have a list of stories to publish in certain channels. A story would rarely qualify for more than 2 channels (a transfer from one club to another), it would most often just qualify for one. In addition, the editor filters out stories by date - any story older than 24h is marked outdated.

The proper cherry on the cake would be to create a Channel Finder module - to find channels automatically. But after having thought about this for a few weeks, I still can't think of a way to do it that would assure any kind of half-decent success rate. Certainly not without trying to analyze English language to some extent, which would be incredibly complicated, if even the least bit effective.

This entry is part of the series Project Newman.