Archive for August 20th, 2006

Project Newman :: The reporter

Sunday, August 20th, 2006

The reporter is the part of Newman which retrieves stories from various websites. The process is fairly straightforward:

  1. Retrieve web page containing a list of the latest news stories.
  2. Read the list of stories and retrieve links to individual stories.
  3. Retrieve each story one by one.

This description is generic enough to be satisfied by every site of those I considered reporting from, notably Football Italia, Tribalfootball, Eurosport, and Goal. Every site has a list of stories and then individual stories on separate pages. But that doesn’t mean there weren’t a few challenges to make this work, notably:

  • Every site uses different html – we have to read the info we need out of the html source by using regular expressions.
  • The result from every story retrieval should be just plain text, no html tags or other code.
  • If the connection fails or times out, Newman should ignore the error and continue, it shouldn’t crash.

(more…)