roopalgarg
roopalgarg

Reputation: 449

Wikipedia Scraper using Python

Hi I am trying to build a simple wikipedia scrapping tool that can let me analyse the text and build a timeline of the events in the life of a person using python. I have searching the net for possible methods to do it and until now i have been able to retrieve the data using BeautifulSoup and urllib2. The code till now looks something like this:

from bs4 import  BeautifulSoup
import urllib2
import re
import nltk
import json


#get source code of page (function used later)
def fetchsource(url):
    source = urllib2.urlopen(url).read()
    return source

if __name__=='__main__':
    #url = "http://en.wikipedia.org/w/index.php?action=raw&title=Tom_Cruise" #works
    url="http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&&titles=Tom_Cruise" #works
    print url
    source = fetchsource(url)
    soup = BeautifulSoup(source)
    print soup.prettify()

Now although i can work with this, but the output that i get is a little tricky to parse and i just wanted to ask if there is a better way to do or maybe a more manageable syntax in which i can retrieve the data. Kindly comment.

Upvotes: 3

Views: 3818

Answers (3)

alchimista
alchimista

Reputation: 196

You can also use pywikipediabot to get the article wikitext. For example, to get the wikitext of Tom Cruise, like in your example, you can use:

import wikipedia

page = wikipedia.Page(wikipedia.getSite(), 'Tom_Cruise')

pageText = page.get()

print pageText

This way you can try to get the data from templates, and there are some parsers for wikitext, if needed.

Upvotes: 6

Neodawn
Neodawn

Reputation: 1096

DBpedia allows structured information in Wikipedia to be retrieved through a query. http://dbpedia.org/

Upvotes: 2

Achim
Achim

Reputation: 15692

Extracting data from html pages is never fun, but http://scrapy.org/ makes it much easier in my opinion. You can use XPath to extract data, which is quite powerful. If you want to retrieve the data that way, I would definitivly use scrapy.

You should also check if there are other options to get the data. As far as I know, it's possible to download a data dump of wikipedia. That might be overkill for your use case, but other APIs might exist.

Upvotes: 2

Related Questions