Web crawler type program - wiki degrees of separation

Question

I got an interesting little side project inspired by today's xkcd tooltip. Basically the premise is that for any Wikipedia article, if you follow the first link (that is not inside brackets or in italics) over and over eventually you will get to the Philosophy article.

I am trying to write a program that basically chooses a wikipedia page at random (probably using the http://en.wikipedia.org/wiki/Special:Random URL) and then determine that pages "depth" from Philosophy.

I've knocked up a program in C (my most familiar language) just to get the plan straight and quickly realised I knew how to do most of it apart from two "minor" (aka the important bits) problems:

char *grab_first_link(page, int n){ 
    //return url of 1st link not in italics or inside parentheses
}

void get_random_page{
    //go to http://en.wikipedia.org/wiki/Special:Random
    //wait 2 seconds
    //return the  URL generated by the random page
}

So basically I'm looking for a library that can help out with simple HTML pages. And some tips on how to get the correct link based on the aforementioned rules.

(also -I'm sure there a million+1 ways to do this more efficiently / easier, I'm just curious if I can get it all/mostly done in C)

Thanks for any help, tips, links or points in the right direction.

Quentin · Accepted Answer

Find an HTML parser library (libxml2 might do the job) and read its manual. XPath will probably be your friend for this.
Find an HTTP client library (and read its manual), then see 1.

Web crawler type program - wiki degrees of separation

Answers (2)

Related Questions