download all media links on a webpage

April 26th, 2008

This has probably happened to you. You come to a web page that has links to a bunch of pictures, or videos, or documents that you want to download. Not one or two, but all. How do you go about it? Personally, I use wget for anything that will take a while to download. It's wonderful, accepts http, https, ftp etc, has options to resume and retry, it never fails. I could just use Firefox, and if it's small files then I do just that, and click all the links in one fell swoop, then let them all download on their own. But if it's larger files then it's not practical. You don't want to download 20 videos of 200mb each in parallel, that's no good. If Firefox crashes within the next few hours (which it probably will) then you'll likely end up with not even one file successfully downloaded. And Firefox doesn't have a resume function (there is a button but it doesn't do anything ).

So there is a fallback option: copy all the links from Firefox and queue them up for wget: right click in document, Copy Link Location, right click in terminal window. This is painful and I last about 4-5 links before I get sick of it, download the web page and start parsing it instead. That always works, but I have to rig up a new chain of grep, sed, tr and xargs wget (or a for loop) for every page, I can never reuse that and so the effort doesn't go a long way.

There is another option. I could use a Firefox extension for this, there are some of them for this purpose. But that too is fraught with pain. Some of them don't work, some only work for some types of files, some still require some amount of manual effort to pick the right urls and so on, some of them don't support resuming a download after Firefox crashes. Not to mention that every new extension slows down Firefox and adds another upgrade cycle you have to worry about. Want to run Firefox 3? Oh sorry, your download extension isn't compatible. wget, in contrast, never stops working. Most limiting of all, these extensions aren't Unix-y. They assume they know what you want, and they take you from start to end. There's no way you can plug in grep somewhere in the chain to filter out things you don't want, for example.

So the problem is eventually reduced to: how can I still use wget? Well, browsers being as lenient as they are, it's difficult to guarantee that you can parse every page, but you can at least try. spiderfetch, whose name describes its function: spider a page for links and then fetch them, attacks the common scenario. You find a page that links to a bunch of media files. So you feed the url to spiderfetch. It will download the page and find all the links (as best it can). It will then download the files one by one. Internally, it uses wget, so you still get the desired functionality and the familiar output.

If the urls on the page require additional post-processing, say they are .asx files you have to download one by one, grab the mms:// url inside, and mplayer -dumpstream, you at least get the first half of the chain. (Unlikely scenario? If you wanted to download these freely available lectures on compilers from the University of Washington, you have little choice. You could even chain spiderfetch to do both: first spider the index page, download all the .asx files, then spider each .asx file for the mms:// url, print it to the screen and let mplayer take it from there. No more grep or sed. )

Features

  • Spiders the page for anything that looks like a url.
  • Ability to filter urls for a regular expression (keep in mind this is still Ruby's regex, so .* to match any character, not * as in file globbing, (true|false) for choice and so on.)
  • Downloads all the urls serially, or just outputs to screen (with --dump) if you want to filter/sort/etc.
  • Can use an existing index file (with --useindex), but then if there are relative links among the urls, they will need post-processing, because the path of the index page on the server is not known after it has been stored locally.
  • Uses wget internally and relays its output as well. Supports http, https and ftp urls.
  • Semantics consistent with for url in urls; do wget $url... does not re-download completed files, resumes downloads, retries interrupted transfers.

Limitations

  • Not guaranteed to find every last url, although the matching is pretty lenient. If you can't match a certain url you're still stuck with grep and sed.
  • If you have to authenticate yourself somehow in the browser to be able to download your media files, spiderfetch won't be able to download them (as with wget in general). However, all is not lost. If the urls are ftp or the web server uses simple authentication, you can still post-process them to: ftp://username:password@the.rest.of.the.url, same for http.

Download spiderfetch:

Recipes

To make the use a bit clearer, let's see some concrete examples.

Recipe: Download the 2008 lectures from Fosdem:

spiderfetch.rb http://www.fosdem.org/2008/media/video 2008.*ogg

Here we use the pattern 2008.*ogg. If you first run spiderfetch with --dump, you'll see that all the urls for the lectures in 2008 contain the string 2008. Further, all the video files have the extension ogg. And whatever characters come in between those two things, we don't care.

Recipe: Download .asx => mms videos

Like it or not, sometimes you have to deal with ugly proprietary protocols. Video files exposed as .asx files are typically pointers to urls of the mms:// protocol. Microsoft calls them metafiles. This snippet illustrates how you can download them. First you spider for all the .asx urls, using the pattern \.asx$, which means "match on strings containing .asx as the last characters of the string". Then we spider each of those urls for actual urls to video files, which begin with mms. And for each one we use mplayer -dumpstream to actually download the video.

#!/bin/bash

mypath=$(cd $(dirname $0); pwd)
webpage="$1"

for url in $($mypath/spiderfetch.rb $webpage "\.asx$" --dump); do 
	video=$($mypath/spiderfetch.rb $url "^mms" --dump)
	mplayer -dumpstream $video -dumpfile $(basename $video)
done

:: random entries in this category ::

2 Responses to "download all media links on a webpage"

  1. [...] » Michael Jackson is a woman (?) 1y (1) I think Michael is not a women cause yo.. -Chase » clocking jruby1.1 6d (5) Hello Cha.. -numerodix | Not bad for JRuby, .. » what the heck is a closure? 1w (2) Yes, x ca.. -numerodix | "In fact, fro.. -Brian « download all media links on a webpage [...]

  2. leu says:

    Hi! I am a IT idiot.

    How can I download the 28 mms videos from the following site quickly?

    http://sprott.physics.wisc.edu/wop.htm

    Thank you very much!