<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>numerodix blog &#187; spiderfetch</title>
	<atom:link href="http://www.matusiak.eu/numerodix/blog/index.php/category/techno-babble/spiderfetch/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.matusiak.eu/numerodix/blog</link>
	<description>A blog about nothing</description>
	<lastBuildDate>Sun, 12 Feb 2012 18:25:03 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>spiderfetch, now in python</title>
		<link>http://www.matusiak.eu/numerodix/blog/index.php/2008/06/28/spiderfetch-now-in-python/</link>
		<comments>http://www.matusiak.eu/numerodix/blog/index.php/2008/06/28/spiderfetch-now-in-python/#comments</comments>
		<pubDate>Sat, 28 Jun 2008 01:18:24 +0000</pubDate>
		<dc:creator>numerodix</dc:creator>
				<category><![CDATA[spiderfetch]]></category>

		<guid isPermaLink="false">http://www.matusiak.eu/numerodix/blog/?p=1055</guid>
		<description><![CDATA[Coding at its most fun is exploratory. It&#8217;s exciting to try your hand at something new and see how it develops, choosing a route as you go along. Some poeple like to call this &#8220;expanding your ignorance&#8221;, to convey that you cannot decide on things you don&#8217;t know about, so first you have to become [...]]]></description>
			<content:encoded><![CDATA[<p>Coding at its most fun is exploratory. It&#8217;s exciting to try your hand at something new and see how it develops, choosing a route as you go along. Some poeple like to call this &#8220;expanding your ignorance&#8221;, to convey that you cannot decide on things you don&#8217;t know about, so first you have to become aware &#8211; and ignorant &#8211; of them. Then you can tackle them. If you want a buzzword for this I suppose you could call this &#8220;impulse driven development&#8221;.</p>
<p><img class="alignright size-full wp-image-1056" title="google_uncool" src="http://www.matusiak.eu/numerodix/blog/wp-content/uploads/google_uncool.png" alt="" width="242" height="139" />spiderfetch was driven completely by impulse. The original idea was to get rid of awkward, one-time grep/sed/awk parsing to extract urls from web pages. Then came the impulse &#8220;hey, it took so much work to get this working well, why not make it recursive at little added effort&#8221;. And from there on countless more impulses happened, to the point that it would be a challenge to recreate the thought process from there to here.</p>
<p>Eventually it landed on a 400 line <a href="http://www.matusiak.eu/numerodix/blog/index.php/2008/04/28/spiderfetch-part-2/">ruby script</a> that worked quite nicely, supported recipes to drive the spider and various other gimmicks. Because the process was completely driven by impulse, the code became increasingly dense and monolithic as more impulses were realized. And it got to the point where the code worked, but was pretty much a dead end from a development point of view. Generally speaking, the deeper you go into a project, gradually the lesser the ideas have to be to be realized without major changes.</p>
<h3><strong>Introducing the web</strong></h3>
<p>The most disruptive new impulse was that since we&#8217;re spidering anyway, it might be fun to collect these urls in a graph and be able to do little queries on them. At the very least things like &#8220;what page did I find this url on&#8221; and &#8220;how did I get here from the root url&#8221; could be useful.</p>
<p style="text-align: center;"><img class="aligncenter size-full wp-image-1059" title="spiderfetch_web_ss" src="http://www.matusiak.eu/numerodix/blog/wp-content/uploads/spiderfetch_web_ss.png" alt="" width="441" height="61" /></p>
<p>spiderfetch introduces the <em>web</em>, a local representation of the urls the spider has seen, either visited (spidered) or matched by any of the rules. Webs are stored, quite simply, in .web files. Technically speaking, the web is a graph of url nodes, with a hash table frontend for quick lookup and duplicate detection. Every node carries information about <em>incoming</em> urls (locations where this url was found) and <em>outgoing</em> urls (links to other documents), so the path from the root to any given url can be traced.</p>
<h3><strong>Detecting file types</strong></h3>
<p>Aside from the web impulse, the single biggest flaw in spiderfetch was the lack of logic to deal with filetypes. Filetypes on the web work pretty much as well as they do on your local computer, which means if you rename a .jpg to a .gif, suddenly it&#8217;s not a .jpg anymore. File extensions are a very weak form of metadata and largely useless. Just the same with spidering, if you find a url on a page you have no idea what it is. If it ends in .html then it&#8217;s probably that, but it can also not have an extension at all. Or it can be misleading, which when taken to perverse lengths (eg. scripts like <a href="http://gallery.menalto.com/">gallery</a>), does away with .jpgs altogether and encodes everything as .php.</p>
<p>In other words, file extensions tell you nothing that you can actually trust. And that&#8217;s a crucial distinction: <em>what information do I have</em> vs <em>what can I trust</em>. In Linux we deal with this using <em>magic</em>. The <em>file</em> command opens the file, reads a portion of it, and scans for well known content that would identify the file as a known type.</p>
<p style="text-align: center;"><img class="size-full wp-image-1061" title="spiderfetch_spider_type_ss" src="http://www.matusiak.eu/numerodix/blog/wp-content/uploads/spiderfetch_spider_type_ss.png" alt="" width="452" height="172" /></p>
<p>For a spider this is a big roadblock, because if you don&#8217;t know what urls are actual html files that you want to spider, you have to pretty much download everything. Including potentially large files like videos that are a complete waste of time (and bandwidth). So spiderfetch brings the &#8220;magic&#8221; principle to spidering. We start a download and wait until we have enough of the file to check the type. If it&#8217;s the wrong type, we abort. Right now we only detect html, but there is a potential for extending this with all the information the <em>file</em> command has (this would involve writing a parser for &#8220;magic&#8221; files, though).</p>
<h3><strong>A brand new fetcher</strong></h3>
<p><img class="alignright size-medium wp-image-1057" title="spiderfetch_fetch_ss" src="http://www.matusiak.eu/numerodix/blog/wp-content/uploads/spiderfetch_fetch_ss.png" alt="" width="290" height="186" />To make filetype detection work, we have to be able to do more than just start a download and wait until it&#8217;s done. spiderfetch has a completely new fetcher in pure python (no more calling <em>wget</em>). The fetcher is actually the whole reason why the switch to python happened in the first place. I was looking through the ruby documentation in terms of what I needed from the library and I soon realized it wasn&#8217;t cutting it. The http stuff was just too puny. I looked up the same topic in the python docs and immediately realized that it will support what I want to do. In retrospect, the python urllib/httplib library has covered me very well.</p>
<p>The fetcher has to do a lot of error handling on all the various conditions that can occur, which means it also has a much deeper awareness of the possible errors. It&#8217;s very useful to know whether a fetch failed on 404 or a dns error. The python library also makes it easy to customize what happens on the various http status codes.</p>
<h3>A modular approach</h3>
<p>The present python code is a far cry from the abandoned ruby codebase. For starters, it&#8217;s three times larger. Python may be a little more verbose than ruby, but the increase is due to a new modularity and most of all, new features. While the ruby code had eventually evolved into one big chunk of code, the python codebase is a number of modules, each of which can be extended quite easily. The <em>spider</em> and <em>fetcher</em> can both be used on their own, there is the new <em>web</em> module to deal with webs, and there is <em>spiderfetch</em> itself. <em>dumpstream</em> has also been rewritten from shellscript to python and has become more reliable.</p>
<p>Grab it from github:</p>
<blockquote><p><a href="http://github.com/numerodix/spiderfetch/commits/0.4.0">spiderfetch-0.4.0</a></p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.matusiak.eu/numerodix/blog/index.php/2008/06/28/spiderfetch-now-in-python/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>spiderfetch, part 2</title>
		<link>http://www.matusiak.eu/numerodix/blog/index.php/2008/04/28/spiderfetch-part-2/</link>
		<comments>http://www.matusiak.eu/numerodix/blog/index.php/2008/04/28/spiderfetch-part-2/#comments</comments>
		<pubDate>Sun, 27 Apr 2008 22:42:24 +0000</pubDate>
		<dc:creator>numerodix</dc:creator>
				<category><![CDATA[spiderfetch]]></category>

		<guid isPermaLink="false">http://www.matusiak.eu/numerodix/blog/?p=973</guid>
		<description><![CDATA[Note: If you haven&#8217;t read part 1 you may be a little lost here.
So, the inevitable happened (as it always does, duh). I start out with a simple problem and not too many ambitions about what I want to accomplish. But once I reach that plateau, nay well before reaching it, I begin to ever [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: left;"><em>Note: If you haven&#8217;t read <a href="http://www.matusiak.eu/numerodix/blog/index.php/2008/04/26/download-all-media-links-on-a-webpage/">part 1</a> you may be a little lost here.</em></p>
<p style="text-align: left;">So, the inevitable happened (as it always does, duh). I start out with a simple problem and not too many ambitions about what I want to accomplish. But once I reach that plateau, nay well before reaching it, I begin to ever so quietly ask myself the question &#8220;wait a second, what if x?&#8221; and &#8220;this looks specialized, I wonder if I could generalize&#8230;&#8221;. And so before ever even reaching that hill top I&#8217;ve already, covertly, committed myself to taking it one step further. Not through a conscious decision, but through those lingering peripheral thoughts that you know won&#8217;t disappear once they&#8217;ve struck. A bell cannot be unrung and all that.</p>
<p style="text-align: left;">I realized this was happening, but I didn&#8217;t want to get into too much grubby stuff in the first blog entry, so I decided to keep that one simple and continue the story here. The first incarnation of <code>spiderfetch</code> had a couple of flaws that bugged me.</p>
<ol style="text-align: left;">
<li>No way to inspect how urls were being matched on the page, or even reason to believe this was happening correctly, other than giving an input and checking that all the expected urls were found. To make matching evident, I would need to be able to see the matches <em>visually</em> on the page.<br />
This has been addressed with a new option <code>--dumpcolor</code>, which dumps the index page and highlights the matches. This has made it much easier to verify that matching is done correctly.</li>
<li>Matching wasn&#8217;t sufficiently effective. The regex I had written would match urls inside tags, as long as they were in quotes. But this would still miss unquoted urls, and it also excluded all other urls on the page, which may or may not be of interest. I also realized that a single regex, no matter how refined, would be unlikely to match simultaneously all the urls that may be of interest.<br />
The obvious response is to add an option for multiple regexes, which is exactly what happened. This obviously adds another layer of complexity to debugging regexes, so the match highlighting was extended to colorize every match in a different color. Furthermore, where two regexes would match the same characters, the highlighting is in bold to indicate this.</li>
</ol>
<p style="text-align: left;">With that, I was far happier with the ability to infer and verify correctness in the matching behavior. Surely now everything is honkey dorey?</p>
<p style="text-align: left;">Or not? (As a classmate of mine likes to say after delivering a convincing argument, but graciously gives you the chance to state your objections anyway). Well, if you read part 1 of this adventure right to the end, noting the observation that <code>spiderfetch</code> could be run recursively, you may have thought what I thought. <em>Well gosh, Bubba, this is starting to sound like <code>wget --mirror</code>.</em> Since I&#8217;ve set up all this infrastructure already &#8212; to spider a single page &#8212; it wouldn&#8217;t really take much to generalize it to run recursively.</p>
<p style="text-align: left;">There are a couple of problems to solve, however. Firstly, the operational model for <code>spiderfetch</code> was very simple: spider a page, then fetch all the urls that match the pattern. In terms of multiplicity: <strong>1</strong> page to spider, <strong>1</strong> pattern to match urls against, <strong>n</strong> urls to find. If we now take this a step further, in the next pass we have <strong>n</strong> urls to spider (obtained form the <em>n</em> urls found in the first step), and we may need <strong>1</strong> pattern to filter some of them. Next, we spider these pages, which produces <em>(m<sub>1</sub>+m<sub>2</sub>+&#8230;)</em> (or roughly, <strong>n*m</strong>) urls and so on. This becomes rather convoluted to explain in words, so let&#8217;s visualize.</p>
<p><img class="alignnone size-full wp-image-974 alignright" style="float: right;" title="spider" src="http://www.matusiak.eu/numerodix/blog/wp-content/uploads/spider.png" alt="" width="284" height="266" />Starting at the url to be spidered (the top green node), we spider the page for urls. For each of the urls found, it ends up in one of three categories:</p>
<ol>
<li>It matches the spider filter, so it becomes a url to spider in the next round (a black arrow).</li>
<li>It matches the fetch filter, so it becomes a url to fetch (a blue arrow).</li>
<li>It matches neither and is discarded (not shown).</li>
</ol>
<p>In the next round, we gather up all the urls that are to be spidered (the black arrows starting at the top green node) and do the same thing for each one as we did with just the one page to begin with.</p>
<p>But this complicates matters quite a lot. We now have to deal with a bunch of new issues:</p>
<ol>
<li>How do we traverse the nodes? <code>wget</code> in mirror/spider modes goes depth-first, which I always thought was eccentric. I don&#8217;t know why they do it this way, but I&#8217;m guessing to minimize memory use. If you go breadth-first then at every step you have to keep track of all the green nodes at the current level, which grows exponentially. Meanwhile, depth-first give you linear growth, so that choice is well justified. But, on the other hand, the traversal order seems a bit unintuitive, because you &#8220;jump&#8221; from the deepest corner of your filesystem back to the top level and so on. I wonder if this turns out to be foolish (I don&#8217;t expect <code>spiderfetch</code> to get the same kind of workout that <code>wget</code> does, obviously), but I&#8217;ve chosen the opposite approach, which I think also makes it easier to track what is happening underway.</li>
<li>How deep do we want to go? Do we want to set an upper bound or (gasp) let it run until it stops?</li>
<li>Until now we&#8217;ve only needed one filter (for the blue arrows at the top green node). Now we suddenly have a lot more arrows that we should be able to filter in some meaningful way. Obviously, we don&#8217;t want a pair of filters for every single node. Not only would that be madness, but we don&#8217;t know in advance how many nodes there will be.<br />
Our old friend <code>wget</code> only has one filter you can set for the whole site. But we want to be more specific than that, so there is a pair of filters <em>(spider, fetch)</em> for every level of the tree. This gives pretty decent granularity.</li>
</ol>
<p>So how can we represent this cleanly? Well, it would be rather messy to have to input this as a command line parameter, besides which a once written scheme for a particular scenario could be reusable. So instead we introduce the idea of a <em>recipe</em> composed of <em>rules</em>. Starting from the top of the tree, each rule applies to the next level of the tree. And once we have no more rules &#8212; or no more urls to spider &#8212; we stop.</p>
<p>Let&#8217;s take the <em>asx</em> example from part 1, where we had a custom made bash script to do the job. We can now rewrite it like this. First, the recipe is a list of rules, each rule is a hash. So starting from the top green node, we grab the first rule in the list, the one that contains the symbol :<code>spider</code>. This gives us the pattern to match urls on that page for spidering. There are no other patterns in there, so we spider these urls and then move on to the next step. We are now at the level below the top green node in the tree, with a bunch of pages from urls ending in <code>.asx</code>. We now grab the next rule in the recipe. This one gives a pattern for :<code>dump</code>, which means &#8220;dump these urls to the screen&#8221;. So we find all the urls that match this pattern in all of our green nodes and dump them. Since there are no more rules left, this is where we stop.</p>
<pre class="ruby"><span style="color:#9966CC; font-weight:bold;">module</span> Recipe
	RECIPE = <span style="color:#006600; font-weight:bold;">&#91;</span>
		<span style="color:#006600; font-weight:bold;">&#123;</span> :spider =&gt; <span style="color:#996600;">"<span style="color:#000099;">\\.</span>asx$"</span> <span style="color:#006600; font-weight:bold;">&#125;</span>,
		<span style="color:#006600; font-weight:bold;">&#123;</span> :dump =&gt; <span style="color:#996600;">"^mms:<span style="color:#000099;">\\/</span><span style="color:#000099;">\\/</span>"</span> <span style="color:#006600; font-weight:bold;">&#125;</span>,
	<span style="color:#006600; font-weight:bold;">&#93;</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre>
<p><i>Download this code: </i><a href="http://www.matusiak.eu/numerodix/blog/wp-content/uploads/asx.rb">asx.rb</a></p>
<p>So you would use it like this:</p>
<blockquote><p>spiderfetch.rb &#8211;recipe asx http://www.something.com/somewhere</p></blockquote>
<p>The options for patterns are <code> <img src='http://www.matusiak.eu/numerodix/blog/wp-includes/images/smilies/confused.png' alt=':s' class='wp-smiley' /> pider</code>, <code>:fetch</code>, and <code> <img src='http://www.matusiak.eu/numerodix/blog/wp-includes/images/smilies/biggrin.png' alt=':d' class='wp-smiley' /> ump</code>. If you want to repeat the same rule several times (for example to spider an image gallery with 10 pages, which are linked together with <em>Next</em> and <em>Previous</em> links), you can also set <code> <img src='http://www.matusiak.eu/numerodix/blog/wp-includes/images/smilies/biggrin.png' alt=':d' class='wp-smiley' /> epth</code> to a positive integer value. This will descend in the tree the given number of times, using the same rule again and again.</p>
<p>And if you&#8217;re feeling completely mental, you can even set <code> <img src='http://www.matusiak.eu/numerodix/blog/wp-includes/images/smilies/biggrin.png' alt=':d' class='wp-smiley' /> epth =&gt; -1</code>, which will repeat the same rule until it runs out of urls to spider. You should probably combine this with <code>--host</code>, which will make sure you only spider the host (domain, to be exact) you started with, rather than the whole internet. (It will still allow <code>:fetch</code> and <code> <img src='http://www.matusiak.eu/numerodix/blog/wp-includes/images/smilies/biggrin.png' alt=':d' class='wp-smiley' /> ump</code> to match urls on other hosts, so if you&#8217;re spidering for images and they live on <em>http://img.host.com</em> rather than <em>http://www.host.com</em>, they will still be found.)</p>
<p>Lastly, as a heavy handed arbitration measure, if you execute a recipe and pass either of <code>--dump</code> or <code>--fetch</code> this will switch <em>all</em> your <code>:fetch</code> patterns to <code> <img src='http://www.matusiak.eu/numerodix/blog/wp-includes/images/smilies/biggrin.png' alt=':d' class='wp-smiley' /> ump</code> or vice versa. Might be nice to be able to check that the right urls are being found before you start fetching, for instance.</p>
<p>Download and go nuts:</p>
<blockquote><p><a href="http://www.matusiak.eu/numerodix/blog/wp-content/uploads/spiderfetch-0.3.1.tar.gz">spiderfetch-0.3.1.tar.gz</a></p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.matusiak.eu/numerodix/blog/index.php/2008/04/28/spiderfetch-part-2/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>download all media links on a webpage</title>
		<link>http://www.matusiak.eu/numerodix/blog/index.php/2008/04/26/download-all-media-links-on-a-webpage/</link>
		<comments>http://www.matusiak.eu/numerodix/blog/index.php/2008/04/26/download-all-media-links-on-a-webpage/#comments</comments>
		<pubDate>Sat, 26 Apr 2008 19:44:54 +0000</pubDate>
		<dc:creator>numerodix</dc:creator>
				<category><![CDATA[spiderfetch]]></category>

		<guid isPermaLink="false">http://www.matusiak.eu/numerodix/blog/?p=967</guid>
		<description><![CDATA[This has probably happened to you. You come to a web page that has links to a bunch of pictures, or videos, or documents that you want to download. Not one or two, but all. How do you go about it? Personally, I use wget for anything that will take a while to download. It&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>This has probably happened to you. You come to a web page that has links to a bunch of pictures, or videos, or documents that you want to download. Not one or two, but all. How do you go about it? Personally, I use <code>wget</code> for anything that will take a while to download. It&#8217;s wonderful, accepts http, https, ftp etc, has options to resume and retry, it never fails. I could just use Firefox, and if it&#8217;s small files then I do just that, and click all the links in one fell swoop, then let them all download on their own. But if it&#8217;s larger files then it&#8217;s not practical. You don&#8217;t want to download 20 videos of 200mb each in parallel, that&#8217;s no good. If Firefox crashes within the next few hours (which it probably will) then you&#8217;ll likely end up with not even one file successfully downloaded. And Firefox doesn&#8217;t have a resume function (there is a button but it doesn&#8217;t do anything <img src='http://www.matusiak.eu/numerodix/blog/wp-includes/images/smilies/rolleyes.png' alt=':rolleyes:' class='wp-smiley' />  ).</p>
<p>So there is a fallback option: copy all the links from Firefox and queue them up for wget: right click in document, <code>Copy Link Location</code>, right click in terminal window. This is painful and I last about 4-5 links before I get sick of it, download the web page and start parsing it instead. That always works, but I have to rig up a new chain of <code>grep</code>, <code>sed</code>, <code>tr</code> and <code>xargs wget</code> (or a <code>for</code> loop) for every page, I can never reuse that and so the effort doesn&#8217;t go a long way.</p>
<p>There is another option. I could use a Firefox extension for this, there are some of them for this purpose. But that too is fraught with pain. Some of them don&#8217;t work, some only work for some types of files,  some still require some amount of manual effort to pick the right urls and so on, some of them don&#8217;t support resuming a download after Firefox crashes. Not to mention that every new extension slows down Firefox and adds another upgrade cycle you have to worry about. Want to run Firefox 3? Oh sorry, your download extension isn&#8217;t compatible. <code>wget</code>, in contrast, never stops working. Most limiting of all, these extensions aren&#8217;t Unix-y. They assume they know what you want, and they take you from start to end. There&#8217;s no way you can plug in <code>grep</code> somewhere in the chain to filter out things you don&#8217;t want, for example.</p>
<p>So the problem is eventually reduced to: how can I still use <code>wget</code>? Well, browsers being as lenient as they are, it&#8217;s difficult to guarantee that you can parse every page, but you can at least try. <code>spiderfetch</code>, whose name describes its function: spider a page for links and then fetch them, attacks the common scenario. You find a page that links to a bunch of media files. So you feed the url to <code>spiderfetch</code>. It will download the page and find all the links (as best it can). It will then download the files one by one. Internally, it uses <code>wget</code>, so you still get the desired functionality and the familiar output.</p>
<p>If the urls on the page require additional post-processing, say they are <code>.asx</code> files you have to download one by one, grab the <code>mms://</code> url inside, and <code>mplayer -dumpstream</code>, you at least get the first half of the chain. (Unlikely scenario? If you wanted to download <a href="http://www.cs.washington.edu/education/courses/582/02au/lectures/index.html">these freely available lectures</a> on compilers from the University of Washington, you have little choice. You could even chain <code>spiderfetch</code> to do both: first spider the index page, download all the <code>.asx</code> files, then spider each <code>.asx</code> file for the <code>mms://</code> url, print it to the screen and let <code>mplayer</code> take it from there. No more <code>grep</code> or <code>sed</code>. <img src='http://www.matusiak.eu/numerodix/blog/wp-includes/images/smilies/smile.png' alt=':)' class='wp-smiley' />  )</p>
<p><strong>Features</strong></p>
<ul>
<li>Spiders the page for anything that looks like a url.</li>
<li>Ability to filter urls for a regular expression (keep in mind this is still Ruby&#8217;s regex, so <code>.*</code> to match any character, not <code>*</code> as in file globbing, <code>(true|false)</code> for choice and so on.)</li>
<li>Downloads all the urls serially, or just outputs to screen (with <code>--dump</code>) if you want to filter/sort/etc.</li>
<li>Can use an existing index file (with <code>--useindex</code>), but then if there are relative links among the urls, they will need post-processing, because the path of the index page on the server is not known after it has been stored locally.</li>
<li>Uses <code>wget</code> internally and relays its output as well. Supports <code>http</code>, <code>https</code> and <code>ftp</code> urls.</li>
<li>Semantics consistent with <code>for url in urls; do wget $url</code>&#8230; does not re-download completed files, resumes downloads, retries interrupted transfers.</li>
</ul>
<p><strong>Limitations</strong></p>
<ul>
<li>Not guaranteed to find every last url, although the matching is pretty lenient. If you can&#8217;t match a certain url you&#8217;re still stuck with <code>grep</code> and <code>sed</code>.</li>
<li>If you have to authenticate yourself somehow in the browser to be able to download your media files, <code>spiderfetch</code> won&#8217;t be able to download them (as with <code>wget</code> in general). However, all is not lost. If the urls are ftp or the web server uses simple authentication, you can still post-process them to: <code>ftp://username:password@the.rest.of.the.url</code>, same for http.</li>
</ul>
<p>Download <code>spiderfetch</code>:</p>
<ul>
<li><a href="http://www.matusiak.eu/numerodix/blog/wp-content/uploads/spiderfetch.rb">spiderfetch.rb</a></li>
</ul>
<p><strong>Recipes</strong></p>
<p>To make the use a bit clearer, let&#8217;s see some concrete examples.</p>
<p><strong>Recipe:</strong> Download the 2008 lectures from Fosdem:</p>
<blockquote><p>spiderfetch.rb http://www.fosdem.org/2008/media/video 2008.*ogg</p></blockquote>
<p>Here we use the pattern <code>2008.*ogg</code>. If you first run <code>spiderfetch</code> with <code>--dump</code>, you&#8217;ll see that all the urls for the lectures in 2008 contain the string <code>2008</code>. Further, all the video files have the extension <code>ogg</code>. And whatever characters come in between those two things, we don&#8217;t care.</p>
<p><strong>Recipe:</strong> Download .asx =&gt; mms videos</p>
<p>Like it or not, sometimes you have to deal with ugly proprietary protocols. Video files exposed as <code>.asx</code> files are typically pointers to urls of the <code>mms://</code> protocol. Microsoft calls them <a href="http://www.microsoft.com/windows/windowsmedia/howto/articles/introwmmeta.aspx">metafiles</a>. This snippet illustrates how you can download them. First you spider for all the .asx urls, using the pattern <code>\.asx$</code>, which means &#8220;match on strings containing <code>.asx</code> as the last characters of the string&#8221;. Then we spider each of those urls for actual urls to video files, which begin with <code>mms</code>. And for each one we use <code>mplayer -dumpstream</code> to actually download the video.</p>
<pre class="bash"><span style="color: #808080; font-style: italic;">#!/bin/bash</span>
&nbsp;
<span style="color: #0000ff;">mypath=</span>$<span style="color: #66cc66;">&#40;</span><span style="color: #000066;">cd</span> $<span style="color: #66cc66;">&#40;</span>dirname $<span style="color: #cc66cc;">0</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #000066;">pwd</span><span style="color: #66cc66;">&#41;</span>
<span style="color: #0000ff;">webpage=</span><span style="color: #ff0000;">"$1"</span>
&nbsp;
<span style="color: #b1b100;">for</span> url <span style="color: #b1b100;">in</span> $<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$mypath</span>/spiderfetch.rb <span style="color: #0000ff;">$webpage</span> <span style="color: #ff0000;">"<span style="color: #000099; font-weight: bold;">\\.</span>asx$"</span> --dump<span style="color: #66cc66;">&#41;</span>; <span style="color: #b1b100;">do</span>
	<span style="color: #0000ff;">video=</span>$<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$mypath</span>/spiderfetch.rb <span style="color: #0000ff;">$url</span> <span style="color: #ff0000;">"^mms"</span> --dump<span style="color: #66cc66;">&#41;</span>
	mplayer -dumpstream <span style="color: #0000ff;">$video</span> -dumpfile $<span style="color: #66cc66;">&#40;</span>basename <span style="color: #0000ff;">$video</span><span style="color: #66cc66;">&#41;</span>
<span style="color: #b1b100;">done</span>
&nbsp;</pre>
<p><i>Download this code: </i><a href="http://www.matusiak.eu/numerodix/blog/wp-content/uploads/asx_spiderfetch.sh">asx_spiderfetch.sh</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.matusiak.eu/numerodix/blog/index.php/2008/04/26/download-all-media-links-on-a-webpage/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

