Reputation: 256
I got an interesting little side project inspired by today's xkcd tooltip. Basically the premise is that for any Wikipedia article, if you follow the first link (that is not inside brackets or in italics) over and over eventually you will get to the Philosophy article.
I am trying to write a program that basically chooses a wikipedia page at random (probably using the http://en.wikipedia.org/wiki/Special:Random URL) and then determine that pages "depth" from Philosophy.
I've knocked up a program in C (my most familiar language) just to get the plan straight and quickly realised I knew how to do most of it apart from two "minor" (aka the important bits) problems:
char *grab_first_link(page, int n){
//return url of 1st link not in italics or inside parentheses
}
void get_random_page{
//go to http://en.wikipedia.org/wiki/Special:Random
//wait 2 seconds
//return the URL generated by the random page
}
So basically I'm looking for a library that can help out with simple HTML pages. And some tips on how to get the correct link based on the aforementioned rules.
(also -I'm sure there a million+1 ways to do this more efficiently / easier, I'm just curious if I can get it all/mostly done in C)
Thanks for any help, tips, links or points in the right direction.
Upvotes: 0
Views: 223
Reputation: 943595
Upvotes: 1
Reputation: 363607
My advice for any program that works on the Wikipedia: don't do this using the HTML; instead, parse the SQL dump, specifically the link table. A link table parser (in C++, not C) is available as part of my Wikiassoc program.
Upvotes: 1