Building a web-crawler for fun and … the hell of it

So I’m reading Programming Collective Intelligence at the moment, by Toby Segaran.  It’s about things like search-engines, recommendations-engines, filtering/sorting results, data-mining and generally Web 2.0 social interactions.  This is interesting in (at least!) three ways:

  1. The examples are in Python, which is new to me, and outside my comfort-zone (I’m writing this as I wait for XCode to come down, so I can get gcc with which to build pysqlite – apparently).
  2. It’s making me dust off my maths/stats from way back, which is painful but probably ultimately healthy.
  3. C’mon; teaching computers how to be smart?  It just is, ok?  Figuring out why people like things is both interesting and incredibly sellable; developing insight via some automated process is clearly a hugely useful tool!

I’ll be interviewing at a place that cites “experience building web-crawlers” on its job-spec, so I thought I’d have a go, while I’m recuperating at home.  Chapter 4 covers this.  It turns out that it’s possible to create a web-crawler that will go off from a starting point that you give it, pull down that page, parse it (BeautifulSoup appears to provide jQuery-selection-engine-like functionality), pull out the links, and start following them to subsequent pages in … ~30 lines of code (leveraging two libraries).  This is … cool.  I think the same kind of thing in C# would have needed … rather more work!  I wonder whether I can get IronPython to play…?

Anyhow.  Xcode is down now, and gcc with it, so I’m off back to play…  I’ll review the book once I’ve finished it.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • StumbleUpon
  • Reddit
  • DotNetKicks
  • DZone
  • LinkedIn
  • Technorati

Leave a comment

Your comment