<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>numerodix blog &#187; newman</title>
	<atom:link href="http://www.matusiak.eu/numerodix/blog/index.php/category/techno-babble/newman/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.matusiak.eu/numerodix/blog</link>
	<description>A blog about nothing</description>
	<lastBuildDate>Sun, 12 Feb 2012 18:25:03 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Project Newman :: An evaluation</title>
		<link>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/29/project-newman-an-evaluation/</link>
		<comments>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/29/project-newman-an-evaluation/#comments</comments>
		<pubDate>Tue, 29 Aug 2006 06:59:48 +0000</pubDate>
		<dc:creator>numerodix</dc:creator>
				<category><![CDATA[newman]]></category>

		<guid isPermaLink="false">http://www.matusiak.eu/numerodix/blog/index.php/2006/08/29/project-newman-an-evaluation/</guid>
		<description><![CDATA[The thing about a project like Newman is that it&#8217;s basically impossible to make it work perfectly. It has a difficult job, because there are so many potential sources of error. Servers may go offline, connections may fail, article formats may change and so on. It is as good as impossible to guarantee that Newman [...]]]></description>
			<content:encoded><![CDATA[<p>The thing about a project like Newman is that it&#8217;s basically impossible to make it work perfectly. It has a difficult job, because there are so many potential sources of error. Servers may go offline, connections may fail, article formats may change and so on. It is as good as impossible to guarantee that Newman will do the right thing, because at the end of the day we are trying to analyze text, and computers are not good at doing that. Just look at spam filters &#8211; they have been improved upon for years, but everyone is still getting spam. Much less than before, of course, so the filters <em>are</em> definitely useful. And Newman too makes mistakes, but it does still succeed <em>quite</em> often.</p>
<p>Newman has been posting on Xtratime.org under the username <a href="http://www.xtratime.org/forum/member.php?u=27372">Carsonne</a>, a French female impersonator of <a href="http://www.xtratime.org/forum/member.php?u=1369">Carson35</a>&#8217;s it would seem. <img src='http://www.matusiak.eu/numerodix/blog/wp-includes/images/smilies/biggrin.png' alt=':D' class='wp-smiley' />  Carsonne averages about 15 posts a day since July 30, that is a little over 350 posts in all, <a href="http://www.xtratime.org/forum/search.php?do=finduser&amp;u=27372">350+ news stories posted</a>. While I haven&#8217;t been keeping score to present statistical numbers, I have kept a close eye on Carsonne and I would estimate that upwards of 90% of the stories posted were correctly parsed, formatted and classified. In fact, I recall about 10-15 misposts of the ones I&#8217;ve seen (which I think is most). And that is an error rate no human poster would have, Carsonne at an estimated 95% success rate is at least an order of magnitude below a human poster (ie. I would claim that a human poster would have a &gt;99.5% success rate at copy/pasting and classifying stories &#8211; less than 2 misposts in 350).</p>
<p><span id="more-376"></span>What about user input, then? Well, unfortunately Newman does present a certain configuration cost, not everything can be automated. In particular, finding channels is something that would be wonderful to automate, given how quickly the forum climate changes. Newman also requires that <em>sources</em> be configured (and if need  be &#8211; updated) for the parsing to work. Of course, once that is in place, Newman can post at will. So that is still quite a limited set of abilities.</p>
<p>The screenshot below shows a typical run of Newman. Quite a few stories were fetched, some were selected for posting, and then posted. It also shows how Newman is fault reliant &#8211; a parsing error was handled gracefully, as was a timeout from the forum web server.</p>
<p style="text-align: center"><img src="http://www.matusiak.eu/numerodix/blog/wp-content/uploads/newman_running.png" id="image378" alt="newman_running.png" /></p>
<p>After 20+ days on the forum, Carsonne has been active long enough to stir up some reactions about &#8220;her&#8221; <img src='http://www.matusiak.eu/numerodix/blog/wp-includes/images/smilies/wink.png' alt=';)' class='wp-smiley' />  posting of news. Carson&#8217;s long tenure has paved the way for posters like this, so Carsonne is seen by most as just another compulsive news poster. &#8220;She&#8221; has taken some heat over posting news in the wrong place (wrong classification), but beyond that it has been no worse than Carson gets daily.</p>
<p><strong>So what have we learnt?</strong></p>
<p>As it often is, it seems that Project Newman has yielded more questions than the number of answers it has given. Sure enough, it isn&#8217;t too hard to automate posting on a forum, it isn&#8217;t too hard to fetch stories from the web and parse them, it certainly isn&#8217;t hard to automate this out of any human&#8217;s ability to keep up. But it <em>is</em> hard to decide what text means, it <em>is</em> hard to decide which story is relevant to what thread, it <em>is</em> hard to decide whether a word in a sentence is a name and so on.</p>
<p>The question is just how to do these things in a reliable way?</p>
<p>Thus endeth Project Newman. Download the code from the <a href="/numerodix/code.php">code page</a> if you&#8217;re interested.</p>
<p><em>This entry is part of the series <a href="http://www.matusiak.eu/numerodix/blog/index.php/2006/08/19/project-newman-an-overview/">Project Newman</a>.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/29/project-newman-an-evaluation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Project Newman :: Further ideas not implemented</title>
		<link>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/27/project-newman-further-ideas-not-implemented/</link>
		<comments>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/27/project-newman-further-ideas-not-implemented/#comments</comments>
		<pubDate>Sun, 27 Aug 2006 09:32:47 +0000</pubDate>
		<dc:creator>numerodix</dc:creator>
				<category><![CDATA[newman]]></category>

		<guid isPermaLink="false">http://www.matusiak.eu/numerodix/blog/index.php/2006/08/26/project-newman-further-ideas-not-implemented/</guid>
		<description><![CDATA[Newman was meant to be a simple design that wouldn&#8217;t take too long to build (it took me about 2 weeks of afternoons to write) and just focus on the issues that are simple to handle without making a complicated mess of it. There are all kinds of ways in which it could be improved [...]]]></description>
			<content:encoded><![CDATA[<p>Newman was meant to be a simple design that wouldn&#8217;t take too long to build (it took me about 2 weeks of afternoons to write) and just focus on the issues that are simple to handle without making a complicated mess of it. There are all kinds of ways in which it could be improved and I&#8217;ll mention some of the ideas that I decided to leave out.</p>
<p><span id="more-375"></span></p>
<p><strong>Channel limits</strong></p>
<p>Even in threads where it is deemed acceptable to post news articles, it isn&#8217;t civil to post 20 articles a day in one thread. To overcome this problem, one could limit the number of stories that may be posted in one channel per day. This would require another cache, to keep track of how many stories have been posted in which channel already today.</p>
<p>This is something I thought I would do, but I haven&#8217;t written it, because it doesn&#8217;t seem to be a problem. Newman can post up to 5 stories in one thread (which is a bit much), but it will only do this in a couple of channels (the Real Madrid one tends to see a lot of news). So the amount of news posted in a channel reflects the amount of news about that club that day (and that seems quite fair).</p>
<p><strong>A bandwidth monitor</strong></p>
<p>Newman deals in receiving and sending data to web servers. This generates a fair bit of traffic. Web hosts tend to be somewhat sensitive about one client making lots of connections, because bandwidth isn&#8217;t free. While Newman does not operate on a mass level and will not overload any server by itself, it may still be useful to know just how much bandwidth it generates.</p>
<p>I haven&#8217;t looked deeply into this, so I&#8217;m not sure how to do it. Web servers send a <em>Content-Length</em> field in http headers, but this will only account for the traffic received. Perhaps a byte count of the uuencoded form input that Newman sends could be used to track outgoing traffic. But even this would not account for the low-level overhead in establishing socket connections.</p>
<p><strong>User agent scrambling</strong></p>
<p>Newman identifies itself as Firefox, so it looks like just another human client connecting to the web server. But the <strong>reporter</strong> retrieves a list of stories and then sequentially retrieves every story. This behavior is too systematic to be human (noone is interested in *every* story), and so a web admin who keeps track of logs, who sees that the user agent is always the same, will assume it&#8217;s the same client making these connections.</p>
<p>To escape that kind of detection, we could easily scramble the user agent string and just make Newman report a different user agent for every connection. This would make it look like the connections are coming from a shared ip address, but from different clients (for instance different people in a school or company).</p>
<p><strong>Client source scrambling</strong></p>
<p>Closely related to the point above, a web admin that wants to block Newman can only do so by blocking the ip address the connections are coming from. This could be overcome by making Newman connect through anonymous proxies (or maybe use the <a href="http://en.wikipedia.org/wiki/Tor_%28anonymity_network%29">Tor network</a>?). And connect everytime through a different proxy &#8211; that would make blocking ip addresses a lot more difficult.</p>
<p><strong>Running stories through online translator</strong></p>
<p>This is a very silly idea, but sometimes people post stories from a non-English source, because the story talks about something that the English language sources haven&#8217;t caught up with yet. Since Xtratime.org is an English speaking forum, most people don&#8217;t understand these articles. So the person who posts it will sometimes post an automatically generated translation of the text, using the notoriously bad <a href="http://babelfish.altavista.com/">Altavista Babelfish</a>.</p>
<p>And so it could be an extra feature for Newman to retrieve articles from <a href="http://www.gazzetta.it/">gazzetta.it</a> or <a href="http://www.marca.es/">Marca</a>, translate them with Babelfish, and post them. This would further reduce the quality of the articles Newman is posting, but it certainly is something that human posters tend to do.</p>
<p><em>This entry is part of the series <a href="http://www.matusiak.eu/numerodix/blog/index.php/2006/08/19/project-newman-an-overview/">Project Newman</a>.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/27/project-newman-further-ideas-not-implemented/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Project Newman :: The scheduler</title>
		<link>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/26/project-newman-the-scheduler/</link>
		<comments>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/26/project-newman-the-scheduler/#comments</comments>
		<pubDate>Sat, 26 Aug 2006 08:49:23 +0000</pubDate>
		<dc:creator>numerodix</dc:creator>
				<category><![CDATA[newman]]></category>

		<guid isPermaLink="false">http://www.matusiak.eu/numerodix/blog/index.php/2006/08/25/project-newman-the-scheduler/</guid>
		<description><![CDATA[Now that we&#8217;ve covered the reporter, the editor and the publisher, we have a functional Newman that can actually post stories. I set up Newman to run in a cron job (ie. at set intervals) to run every three hours, but then it occurred to me that it isn&#8217;t human behavior to post at 9am, [...]]]></description>
			<content:encoded><![CDATA[<p>Now that we&#8217;ve covered the <strong>reporter</strong>, the <strong>editor</strong> and the <strong>publisher</strong>, we have a functional Newman that can actually post stories. I set up Newman to run in a cron job (ie. at set intervals) to run every three hours, but then it occurred to me that it isn&#8217;t human behavior to post at 9am, then at 12am, then at 3pm and so on, it just doesn&#8217;t look real. And if someone were to keep an eye on Newman, they might notice that it always posts at regular intervals, which looks odd.  (The point here isn&#8217;t so much to fool people into believing that Newman is real, it is just to make it so that it seems to exhibit a lot of human qualities.)</p>
<p>So I thought why not add a scheduler to decide when Newman should run. The scheduler runs as a daemon (ie. an application that runs 24/7 in the background, but only does actual work whenever it is called upon). So the scheduler is given a time interval (for instance: 3 hours), and then it generates a random number between 0 and 3 hours. That&#8217;s when Newman is going to run. And then it goes to sleep until that time. So if I start the scheduler at 10am, give it an interval of three hours, it may decide that Newman should run at 11.45. So then it goes to sleep until 11.45 and then it runs Newman.</p>
<div style="text-align: center"><img alt="newman_scheduler.png" id="image374" src="http://www.matusiak.eu/numerodix/blog/wp-content/uploads/newman_scheduler.png" /></div>
<p>The advantage of this method is also that if the scheduler runs Newman and Newman crashes, it won&#8217;t make the scheduler crash. So the scheduler will still keep running and will again run Newman at the next interval. I&#8217;ve also made sure that the scheduler waits for Newman to finish, so that if Newman is taking a lot time to complete and the next interval is in 5 minutes, Newman will not be started again until the current execution is finished.</p>
<p><em>This entry is part of the series <a href="http://www.matusiak.eu/numerodix/blog/index.php/2006/08/19/project-newman-an-overview/">Project Newman</a>.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/26/project-newman-the-scheduler/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Project Newman :: The publisher</title>
		<link>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/24/project-newman-the-publisher/</link>
		<comments>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/24/project-newman-the-publisher/#comments</comments>
		<pubDate>Fri, 25 Aug 2006 04:52:34 +0000</pubDate>
		<dc:creator>numerodix</dc:creator>
				<category><![CDATA[newman]]></category>

		<guid isPermaLink="false">http://www.matusiak.eu/numerodix/blog/index.php/2006/08/23/project-newman-the-publisher/</guid>
		<description><![CDATA[Compared to what we&#8217;ve talked about so far, the publisher is a pretty simple piece of the puzzle. It receives a list of stories, each one assigned to one or more channels, and simply posts them on the selected target, that is Xtratime.org. For this to work, we must first prepare an account on the [...]]]></description>
			<content:encoded><![CDATA[<p>Compared to what we&#8217;ve talked about so far, the <strong>publisher</strong> is a pretty simple piece of the puzzle. It receives a list of stories, each one assigned to one or more <em>channels</em>, and simply posts them on the selected <em>target</em>, that is Xtratime.org. For this to work, we must first prepare an account on the forum for Newman. Having done that, the publisher will log in the user, open the thread where the story should be posted and simply post it, adding some vBcode formatting to the text. The image below shows what a typical news post looks like.</p>
<p style="text-align: center"><img src="http://www.matusiak.eu/numerodix/blog/wp-content/uploads/newman_publisher.png" alt="newman_publisher.png" id="image370" /></p>
<p><span id="more-369"></span>While Newman posts articles on Xtratime.org, which is a vBulletin forum, it could just as easily post them on any other website, it&#8217;s just a matter of reading the html code sent to us from the server and submitting html forms with the correct data. I won&#8217;t bore you with the details. Of course, there is always a chance that the server may drop the connection while the posting is in progress, or the connection may fail in the first place. In these cases, the publisher will report the error, but it will try to post the next story as if nothing happened.</p>
<p>In the event that vBulletin rejects the post for any reason, Newman tries to read this error and report it. It may be that two stories have been posted too close together (vBulletin has a limit for how often posts can occur by the same user), perhaps something else didn&#8217;t go to plan. In any event, it handles these errors gracefully without crashing.</p>
<p>In early stages of development, I was testing Newman on a test forum I set up, which was protected with a password. Newman can use basic http authentication to access web sites in that way.</p>
<p><em>This entry is part of the series <a href="http://www.matusiak.eu/numerodix/blog/index.php/2006/08/19/project-newman-an-overview/">Project Newman</a>.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/24/project-newman-the-publisher/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Project Newman :: The editor</title>
		<link>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/22/project-newman-the-editor/</link>
		<comments>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/22/project-newman-the-editor/#comments</comments>
		<pubDate>Tue, 22 Aug 2006 06:36:21 +0000</pubDate>
		<dc:creator>numerodix</dc:creator>
				<category><![CDATA[newman]]></category>

		<guid isPermaLink="false">http://www.matusiak.eu/numerodix/blog/index.php/2006/08/22/project-newman-the-editor/</guid>
		<description><![CDATA[The editor is basically the &#8220;brain&#8221; of Newman. It&#8217;s the most complicated part, because it has to handle the most logic. Broadly speaking, it is the editor&#8217;s job to figure out which news articles to post where. Once the target is set (ie. Xtratime.org), the editor has to figure out whether each of the articles [...]]]></description>
			<content:encoded><![CDATA[<p>The <strong>editor</strong> is basically the &#8220;brain&#8221; of Newman. It&#8217;s the most complicated part, because it has to handle the most logic. Broadly speaking, it is the editor&#8217;s job to figure out which news articles to post where. Once the <em>target</em> is set (ie. Xtratime.org), the editor has to figure out whether each of the articles delivered by the <strong>reporter</strong> should be published in any of the <em>channels</em> (ie. threads) we have available. The illustration below shows the editor&#8217;s role in the chain.</p>
<div style="text-align: center"><img alt="newman_schematic.png" id="image377" src="http://www.matusiak.eu/numerodix/blog/wp-content/uploads/newman_schematic.png" /></div>
<p><span id="more-368"></span><strong>Finding channels<br />
</strong></p>
<p>But before we dive right into it, a small note on channels is in order. Let&#8217;s start with a rather more basic question: <em>How does a human post news articles?</em> Carson35 will post articles wherever the particular story is relevant &#8211; either to the thread at large, or to the last few posts specifically. The question is whether Newman can imitate this behavior. Xtratime.org is divided into lots of forums, one for each club, where threads about that one club are found. Some of these forums have special threads active all through the season, like for instance a &#8220;transfer rumours&#8221; thread. So what should Newman do to decide where to post an article? It could iterate over the threads in a certain forum to figure out &#8220;what this thread is about&#8221;. But that is rather difficult to do, for a bot. Given just one sentence, how do you mechanically establish such a terribly human observation &#8211; <em>what it&#8217;s &#8220;about&#8221;</em>? If Newman could do this, it would be quite clever. But, I must admit that I can&#8217;t think of a method.</p>
<p>So, the approach I took was to input a list of channels manually selected. I can&#8217;t think of a way to establish what a post is about, or a thread is about, or if a certain story should be posted in a specific thread. So I had to fall back on a human method and simply give Newman a list of threads it can use to post articles in.</p>
<p><strong>The subject filter</strong></p>
<p>So I&#8217;ve built Newman to do all the tedious work for me, but I&#8217;ve already had to produce the channels myself, it would be nice if Newman could do some work too now. Given the channels, we now have a bunch of stories and a bunch of channels &#8211; how do we match them? I&#8217;ve selected my channels so that I have one channel per club forum. One thread to post news articles in is already enough to aggravate sensitive forum people, so I&#8217;m not going to push my luck. This also means that I have to figure out if a certain story is <em>about</em> a certain club, or not. This I call the <em>subject filter</em> (ie. to establish the subject of the story).</p>
<p>If you think this is already getting hazy, unfortunately it doesn&#8217;t get any better. I&#8217;m not at all interested in trying to deduce the meaning of sentences in English (this would likely take me around forever to finish). Instead, I&#8217;m limiting myself to just looking at individual words. So while a complete analysis would reveal that the phrase &#8220;the royal club&#8221; may be talking about Real Madrid, I won&#8217;t be getting into that. I will limit myself to looking for just words. Now, it may seem prudent that in order to establish that a story talks about a certain club, it would be helpful to look for the names of players who play for that club. But players change clubs all the time, so the list of players for every club would have to be updated every so often (and remember: we&#8217;re trying to <em>minimize</em> the human input here). Worse still, half the stories in the papers about Real Madrid discuss possible signings of players who belong to <em>other</em> clubs. To consider adding all players <em>linked</em> with a club to the list of players <em>at</em> every club would be mad.</p>
<p>So, the only thing I will use is the one name that doesn&#8217;t ever change: the name of the club itself. In its many incarnations. So a story that mentions &#8220;Real Madrid&#8221; is one that we probably want to classify as eligible for the Real Madrid channel. But it could also just mention <em>Real</em> or <em>Madrid</em> on their own, so we have to consider that too. But then again, <em>Real</em> could also refer to Real Zaragoza, so then &#8220;Real&#8221; should be a weaker match than &#8220;Real Madrid&#8221;. As you can probably see by now, this is going in the direction of a spam filter: searching for words and scoring them according to certain rules. In addition, it struck me that the <em>position</em> of a word in a text tends to mean something (if it&#8217;s in the beginning of the story, or in the title, it should give a higher score). Finally, the length of a story matters as well. In a typical story of the kind we like to analyze, the name of a club may appear 2-3 to 5 times. In a very short story, it may only appear once. In a very long story, it may appear more times, in the guise of nicknames and phrases like &#8220;the royal club&#8221;. So a long story may mention Real Madrid 3 times, but may actually be about Barcelona, so we will not give &#8220;Real Madrid&#8221; as high a score as it would get in a shorter story.</p>
<p>Getting a bit hazy, is it? I thought it might. The subject filter works fairly well in most cases. There have been occasional whoopsies, like a story about Arsenal de Sarandí posted in the Arsenal (the English one) forum. And there was a story about Luis Valencia matching the Valencia channel. I have tried to filter this by searching for Valencia as part of a name (ie. as part of a sequence of capitalized words) &#8211; and penalizing that match under suspicion for being the name of a person &#8211; but this kind of thing is very imprecise.</p>
<p><strong>The topic filter</strong></p>
<p>So far so good (do I sense hesitation?). For some channels it is enough to use the subject filter. But for others, those which have to do with transfer rumours, we should also decide whether a certain story is about possible transfers. (Incidentally, most soccer news <em>is</em>.) For this I created a separate filter. So that a story matching on &#8220;Real Madrid&#8221; would then have to pass through the <em>topic filter</em> to see if it seems to be transfer news. For this I used a word list &#8211; a list of words that are highly relevant to transfers, such as <em>contract</em> and <em>offer</em>. Then I scored these words just like I did with the subject filter and set the threshold after some trial and error to filter stories fairly reliably. In any case, I would rather filter out more stories wrongly than to post irrelevant news (just like a spam filter would rather allow more spam than to risk losing your non-spam email). This way, some transfer news didn&#8217;t make the cut, but afterall there are enough stories published everyday to suffice.</p>
<p><strong>Is that all?!?</strong></p>
<p>So finally, after running every story through the <em>subject filter</em>, and if need be the <em>topic filter</em>, I would have a list of stories to publish in certain channels. A story would rarely qualify for more than 2 channels (a transfer from one club to another), it would most often just qualify for one. In addition, the editor filters out stories by date &#8211; any story older than 24h is marked outdated.</p>
<p>The proper cherry on the cake would be to create a Channel Finder module &#8211; to find channels automatically. But after having thought about this for a few weeks, I still can&#8217;t think of a way to do it that would assure any kind of half-decent success rate. Certainly not without trying to analyze English language to some extent, which would be incredibly complicated, if even the least bit effective.</p>
<p><em>This entry is part of the series <a href="http://www.matusiak.eu/numerodix/blog/index.php/2006/08/19/project-newman-an-overview/">Project Newman</a>.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/22/project-newman-the-editor/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Project Newman :: The reporter</title>
		<link>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/20/project-newman-the-reporter/</link>
		<comments>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/20/project-newman-the-reporter/#comments</comments>
		<pubDate>Sun, 20 Aug 2006 12:51:41 +0000</pubDate>
		<dc:creator>numerodix</dc:creator>
				<category><![CDATA[newman]]></category>

		<guid isPermaLink="false">http://www.matusiak.eu/numerodix/blog/index.php/2006/08/20/project-newman-the-reporter/</guid>
		<description><![CDATA[The reporter is the part of Newman which retrieves stories from various websites. The process is fairly straightforward:

Retrieve web page containing a list of the latest news stories.
Read the list of stories and retrieve links to individual stories.
Retrieve each story one by one.

This description is generic enough to be satisfied by every site of those [...]]]></description>
			<content:encoded><![CDATA[<p>The <strong>reporter</strong> is the part of Newman which retrieves stories from various websites. The process is fairly straightforward:</p>
<ol>
<li>Retrieve web page containing a list of the latest news stories.</li>
<li>Read the list of stories and retrieve links to individual stories.</li>
<li>Retrieve each story one by one.</li>
</ol>
<p>This description is generic enough to be satisfied by every site of those I considered reporting from, notably <a href="http://www.channel4.com/sport/football_italia/">Football Italia</a>, <a href="http://www.tribalfootball.com">Tribalfootball</a>, <a href="http://eurosport.com/">Eurosport</a>, and <a href="http://goal.com/">Goal</a>. Every site has a list of stories and then individual stories on separate pages. But that doesn&#8217;t mean there weren&#8217;t a few challenges to make this work, notably:</p>
<ul>
<li>Every site uses different html &#8211; we have to read the info we need out of the html source by using regular expressions.</li>
<li>The result from every story retrieval should be just plain text, no html tags or other code.</li>
<li>If the connection fails or times out, Newman should ignore the error and continue, <em>it shouldn&#8217;t crash</em>.</li>
</ul>
<p><span id="more-365"></span>Out of every story we need the <em>title</em>, the <em>date</em>, and the <em>body</em> of the story. The rest we can blissfully ignore. But evenso, Football Italia presents these three elements in the order we want, but Goal  prints the <em>date</em> first, then the <em>title</em> and <em>body</em>. It also divides the body into a <em>summary</em> and <em>the rest</em>. So these trivial variations had to be handled specifically for each site. Doing this requires analysis of the html code, which is not something Newman can do automatically. The image below shows a sample of html source and below it the regular expression needed to parse it.</p>
<p style="text-align: center"><img src="http://www.matusiak.eu/numerodix/blog/wp-content/uploads/parsing.png" id="image366" alt="parsing.png" /></p>
<p>One other point is that this parsing (text analysis) depends on the html being a certain way, everytime. So if one story has two <code>&lt;br&gt;</code> tags between the date and the body, but another story has three, the parsing is likely to fail (the parsing is in fact a bit smarter than that, but it will only work with small variations). Even worse, should one of these sites do a redesign and change their whole html code, the whole analysis would have to be redone (this took me anything from 5 to 30 minutes for every site).</p>
<p>Once the three elements of the story have been read, it all has to be cleaned up and formatted. We don&#8217;t want any html tags anywhere, and we don&#8217;t want any funny characters that will <a href="http://www.matusiak.eu/numerodix/blog/index.php/2006/08/17/charset-wars/">come out garbled</a>. Anything retrieved from the web is by definition garbage, so we need to make sure that we clean it up whether or not it is clean. Once we&#8217;ve done that, we need to do some formatting. Again we assume nothing about how the story is formatted when it comes in. For all we know there may be 14 spaces between each word (html ignores whitespaces when there is more than one), 5 line breaks between paragraphs and so on. There are some things we can fix easily &#8211; for instance there should never be a space between a character and a comma that follows it &#8211; and some things we cannot do much about &#8211; it is difficult to determine whether there is a line break within a sentence, because it&#8217;s hard to tell what is a sentence and what isn&#8217;t (do sentences always begin with a capital letter? what if there is a typo in the story? or what if a name is capitalized, how do you know if that&#8217;s the start of the sentence or just a part of it? what if the previous sentence is missing a full stop? etc).</p>
<p>Ultimately, Newman is quite good at reporting stories. It tolerates connection errors and it has a very high success rate in cleaning and formatting stories correctly. It does sometimes miss funky special characters on account of web sites not telling us what character set they use (or saying they use one but then encoding in another one, or differences in encoding from one story to the next etc).</p>
<p>One last important issue the reporter does for us is handle the <em>story cache</em>. When the list of stories is retrieved, Newman stores the story title and url to the story in a cache, so that next time it again retrieves the list of stories, it will know which stories it has already retrieved in the past (to make sure the same story won&#8217;t be posted multiple times). This reduces the amount of bandwidth that Newman uses (let&#8217;s be nice to web hosts) and it speeds up Newman as well.</p>
<p><em>This entry is part of the series <a href="http://www.matusiak.eu/numerodix/blog/index.php/2006/08/19/project-newman-an-overview/">Project Newman</a>.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/20/project-newman-the-reporter/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Project Newman :: An overview</title>
		<link>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/19/project-newman-an-overview/</link>
		<comments>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/19/project-newman-an-overview/#comments</comments>
		<pubDate>Sat, 19 Aug 2006 09:31:56 +0000</pubDate>
		<dc:creator>numerodix</dc:creator>
				<category><![CDATA[newman]]></category>

		<guid isPermaLink="false">http://www.matusiak.eu/numerodix/blog/?p=364</guid>
		<description><![CDATA[Project Newman table of contents 

An introduction
An overview
The main components

The reporter


The editor


The publisher

Additional features

The scheduler


Further ideas not implemented

An evaluation

Getting started As mentioned in the introduction, Project Newman is about building a newsbot &#8211; a robot to post news. Now that the purpose and basic idea has been drawn up, it&#8217;s time to get into some [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Project Newman table of contents </strong></p>
<ul>
<li><a href="http://www.matusiak.eu/numerodix/blog/?p=362">An introduction</a></li>
<li><a href="http://www.matusiak.eu/numerodix/blog/?p=364">An overview</a></li>
<li>The main components</li>
<ul>
<li><a href="http://www.matusiak.eu/numerodix/blog/index.php/2006/08/20/project-newman-the-reporter/">The reporter</a></li>
</ul>
<ul>
<li><a href="http://www.matusiak.eu/numerodix/blog/index.php/2006/08/22/project-newman-the-editor/">The editor</a></li>
</ul>
<ul>
<li><a href="http://www.matusiak.eu/numerodix/blog/index.php/2006/08/24/project-newman-the-publisher/">The publisher</a></li>
</ul>
<li>Additional features</li>
<ul>
<li><a href="http://www.matusiak.eu/numerodix/blog/index.php/2006/08/26/project-newman-the-scheduler/">The scheduler</a></li>
</ul>
<ul>
<li><a href="http://www.matusiak.eu/numerodix/blog/index.php/2006/08/27/project-newman-further-ideas-not-implemented/">Further ideas not implemented</a></li>
</ul>
<li><a href="http://www.matusiak.eu/numerodix/blog/index.php/2006/08/29/project-newman-an-evaluation/">An evaluation</a></li>
</ul>
<p><strong>Getting started </strong>As mentioned in the introduction, Project Newman is about building a newsbot &#8211; a robot to post news. Now that the purpose and basic idea has been drawn up, it&#8217;s time to get into some specifics.</p>
<p>Newman would basically be doing three things and so it makes sense to design those three functions in separate parts:</p>
<ul>
<li>the <strong>reporter</strong> will fetch news stories from various football news websites, which we call <em>sources</em></li>
<li>the <strong>editor</strong> will edit the stories, deciding which one to post and which to discard</li>
<li>the <strong>publisher</strong> will post stories on Xtratime.org (or theoretically other sites, which we call <em>targets</em>)</li>
</ul>
<p>So that&#8217;s the basic architecture. (If you think this smells too much of java-speak, don&#8217;t worry, I only used OO where it was feasible, most of it is just python modules).</p>
<p>And there is one rule in Project Newman:</p>
<ul>
<li>Newman must run without any user interaction!</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/19/project-newman-an-overview/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Project Newman :: An introduction</title>
		<link>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/17/newman-an-introduction/</link>
		<comments>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/17/newman-an-introduction/#comments</comments>
		<pubDate>Thu, 17 Aug 2006 21:19:47 +0000</pubDate>
		<dc:creator>numerodix</dc:creator>
				<category><![CDATA[newman]]></category>

		<guid isPermaLink="false">http://www.matusiak.eu/numerodix/blog/?p=362</guid>
		<description><![CDATA[I have been posting on Xtratime.org (a football forum) since sometime in 2000. The site has been through a lot in that time, but one thing that hasn&#8217;t changed is a member called Carson35 posting news stories from various football news sites with astonishing regularity. He now has 74k+ posts, far more than anyone else, [...]]]></description>
			<content:encoded><![CDATA[<p>I have been posting on <a href="http://www.xtratime.org/forum">Xtratime.org</a> (a football forum) since sometime in 2000. The site has been through a lot in that time, but one thing that hasn&#8217;t changed is a member called <a href="http://www.xtratime.org/forum/member.php?u=1369">Carson35</a> posting news stories from various football news sites with astonishing regularity. He now has 74k+ posts, far more than anyone else, and most of those are copy/paste jobs of news stories. Over the years he&#8217;s become a celebrity for his undaunting commitment to bring the news, decorated with a special title &#8211; <em>XT Post Number King</em>. Some have jokingly suggested that he&#8217;s a robot, programmed to do this one thing.</p>
<p>So I thought it would be fun to try and imitate Carson, as a tribute if you will. And, of course, I mean computationally, in an automated manner. The purpose of such a thing would be to satisfy my curiosity in certain areas:</p>
<ul>
<li>how hard would it be to imitate Carson35 by posting news articles?</li>
<li>how closely could I be able to reproduce his activity?</li>
<li>what are the biggest challenges in making this work without any user input?</li>
<li>just how automated could it be done?</li>
<li>could I build a bot that would be accepted (or at least not hated) by other members for spamming?</li>
</ul>
<p>The project was first dubbed Carson36, as an increment of the Carson we all know. But then Erik suggested Newman &#8211; for a bot that brings the news &#8211; and I couldn&#8217;t resist that name. <img src='http://www.matusiak.eu/numerodix/blog/wp-includes/images/smilies/biggrin.png' alt=':D' class='wp-smiley' /> </p>
<p style="text-align: center"><img src="http://www.matusiak.eu/numerodix/blog/wp-content/uploads/newman.jpg" id="image363" alt="newman.jpg" /></p>
<p>While this is a technical topic, I&#8217;ll try to do something I&#8217;m not good at &#8211; explain it in simple terms. That&#8217;s what good technical writers do, and it would be nice to imitate.</p>
<p><em>This entry is part of the series <a href="http://www.matusiak.eu/numerodix/blog/index.php/2006/08/19/project-newman-an-overview/">Project Newman</a>.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.matusiak.eu/numerodix/blog/index.php/2006/08/17/newman-an-introduction/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

