user59634
user59634

Reputation:

Getting a large number (but not all) Wikipedia pages

For a NLP project of mine, I want to download a large number of pages (say, 10000) at random from Wikipedia. Without downloading the entire XML dump, this is what I can think of:

  1. Open a Wikipedia page
  2. Parse the HTML for links in a Breadth First Search fashion and open each page
  3. Recursively open links on the pages obtained in 2

In steps 2 and 3, I will quit, if I have reached the number of pages I want.

How would you do it? Please suggest better ideas you can think of.

ANSWER: This is my Python code:

# Get 10000 random pages from Wikipedia.
import urllib2
import os
import shutil
#Make the directory to store the HTML pages.
print "Deleting the old randompages directory"
shutil.rmtree('randompages')

print "Created the directory for storing the pages"
os.mkdir('randompages')

num_page = raw_input('Number of pages to retrieve:: ')

for i in range(0, int(num_page)):
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    infile = opener.open('http://en.wikipedia.org/wiki/Special:Random')

    page = infile.read()

    # Write it to a file.
    # TODO: Strip HTML from page
    f= open('randompages/file'+str(i)+'.html','w')
    f.write(page)
    f.close()

    print "Retrieved and saved page",i+1

Upvotes: 4

Views: 3009

Answers (3)

Michael Dorfman
Michael Dorfman

Reputation: 4100

I'd go the opposite way-- start with the XML dump, and then throw away what you don't want.

In your case, if you are looking to do natural language processing, I would assume that you are interested in pages that have complete sentences, and not lists of links. If you spider the links in the manner you describe, you'll be hitting a lot of link pages.

And why avoid the XML, when you get the benefit of using XML parsing tools that will make your selection process easier?

Upvotes: 1

Pierre
Pierre

Reputation: 35256

Wikipedia has an API. With this API you can get any random article in a given namespace:

http://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=5

and for each article you call also get the wiki text:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Main%20Page&rvprop=content

Upvotes: 20

Tommy Carlier
Tommy Carlier

Reputation: 8149

for i = 1 to 10000
    get "http://en.wikipedia.org/wiki/Special:Random"

Upvotes: 24

Related Questions