Archive for the ‘newman’ Category

Project Newman :: The reporter

Sunday, August 20th, 2006

The reporter is the part of Newman which retrieves stories from various websites. The process is fairly straightforward:

  1. Retrieve web page containing a list of the latest news stories.
  2. Read the list of stories and retrieve links to individual stories.
  3. Retrieve each story one by one.

This description is generic enough to be satisfied by every site of those I considered reporting from, notably Football Italia, Tribalfootball, Eurosport, and Goal. Every site has a list of stories and then individual stories on separate pages. But that doesn’t mean there weren’t a few challenges to make this work, notably:

  • Every site uses different html – we have to read the info we need out of the html source by using regular expressions.
  • The result from every story retrieval should be just plain text, no html tags or other code.
  • If the connection fails or times out, Newman should ignore the error and continue, it shouldn’t crash.

(more…)

Project Newman :: An overview

Saturday, August 19th, 2006

Project Newman table of contents

Getting started As mentioned in the introduction, Project Newman is about building a newsbot – a robot to post news. Now that the purpose and basic idea has been drawn up, it’s time to get into some specifics.

Newman would basically be doing three things and so it makes sense to design those three functions in separate parts:

  • the reporter will fetch news stories from various football news websites, which we call sources
  • the editor will edit the stories, deciding which one to post and which to discard
  • the publisher will post stories on Xtratime.org (or theoretically other sites, which we call targets)

So that’s the basic architecture. (If you think this smells too much of java-speak, don’t worry, I only used OO where it was feasible, most of it is just python modules).

And there is one rule in Project Newman:

  • Newman must run without any user interaction!

Project Newman :: An introduction

Thursday, August 17th, 2006

I have been posting on Xtratime.org (a football forum) since sometime in 2000. The site has been through a lot in that time, but one thing that hasn’t changed is a member called Carson35 posting news stories from various football news sites with astonishing regularity. He now has 74k+ posts, far more than anyone else, and most of those are copy/paste jobs of news stories. Over the years he’s become a celebrity for his undaunting commitment to bring the news, decorated with a special title – XT Post Number King. Some have jokingly suggested that he’s a robot, programmed to do this one thing.

So I thought it would be fun to try and imitate Carson, as a tribute if you will. And, of course, I mean computationally, in an automated manner. The purpose of such a thing would be to satisfy my curiosity in certain areas:

  • how hard would it be to imitate Carson35 by posting news articles?
  • how closely could I be able to reproduce his activity?
  • what are the biggest challenges in making this work without any user input?
  • just how automated could it be done?
  • could I build a bot that would be accepted (or at least not hated) by other members for spamming?

The project was first dubbed Carson36, as an increment of the Carson we all know. But then Erik suggested Newman – for a bot that brings the news – and I couldn’t resist that name. :D

newman.jpg

While this is a technical topic, I’ll try to do something I’m not good at – explain it in simple terms. That’s what good technical writers do, and it would be nice to imitate.

This entry is part of the series Project Newman.