Reputation:
For a NLP project of mine, I want to download a large number of pages (say, 10000) at random from Wikipedia. Without downloading the entire XML dump, this is what I can think of:
In steps 2 and 3, I will quit, if I have reached the number of pages I want.
How would you do it? Please suggest better ideas you can think of.
ANSWER: This is my Python code:
# Get 10000 random pages from Wikipedia.
import urllib2
import os
import shutil
#Make the directory to store the HTML pages.
print "Deleting the old randompages directory"
shutil.rmtree('randompages')
print "Created the directory for storing the pages"
os.mkdir('randompages')
num_page = raw_input('Number of pages to retrieve:: ')
for i in range(0, int(num_page)):
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/wiki/Special:Random')
page = infile.read()
# Write it to a file.
# TODO: Strip HTML from page
f= open('randompages/file'+str(i)+'.html','w')
f.write(page)
f.close()
print "Retrieved and saved page",i+1
Upvotes: 4
Views: 3009
Reputation: 4100
I'd go the opposite way-- start with the XML dump, and then throw away what you don't want.
In your case, if you are looking to do natural language processing, I would assume that you are interested in pages that have complete sentences, and not lists of links. If you spider the links in the manner you describe, you'll be hitting a lot of link pages.
And why avoid the XML, when you get the benefit of using XML parsing tools that will make your selection process easier?
Upvotes: 1
Reputation: 35256
Wikipedia has an API. With this API you can get any random article in a given namespace:
http://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=5
and for each article you call also get the wiki text:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Main%20Page&rvprop=content
Upvotes: 20
Reputation: 8149
for i = 1 to 10000
get "http://en.wikipedia.org/wiki/Special:Random"
Upvotes: 24