Reputation: 449
Hi I am trying to build a simple wikipedia scrapping tool that can let me analyse the text and build a timeline of the events in the life of a person using python. I have searching the net for possible methods to do it and until now i have been able to retrieve the data using BeautifulSoup and urllib2. The code till now looks something like this:
from bs4 import BeautifulSoup
import urllib2
import re
import nltk
import json
#get source code of page (function used later)
def fetchsource(url):
source = urllib2.urlopen(url).read()
return source
if __name__=='__main__':
#url = "http://en.wikipedia.org/w/index.php?action=raw&title=Tom_Cruise" #works
url="http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&&titles=Tom_Cruise" #works
print url
source = fetchsource(url)
soup = BeautifulSoup(source)
print soup.prettify()
Now although i can work with this, but the output that i get is a little tricky to parse and i just wanted to ask if there is a better way to do or maybe a more manageable syntax in which i can retrieve the data. Kindly comment.
Upvotes: 3
Views: 3818
Reputation: 196
You can also use pywikipediabot to get the article wikitext. For example, to get the wikitext of Tom Cruise, like in your example, you can use:
import wikipedia
page = wikipedia.Page(wikipedia.getSite(), 'Tom_Cruise')
pageText = page.get()
print pageText
This way you can try to get the data from templates, and there are some parsers for wikitext, if needed.
Upvotes: 6
Reputation: 1096
DBpedia allows structured information in Wikipedia to be retrieved through a query. http://dbpedia.org/
Upvotes: 2
Reputation: 15692
Extracting data from html pages is never fun, but http://scrapy.org/ makes it much easier in my opinion. You can use XPath to extract data, which is quite powerful. If you want to retrieve the data that way, I would definitivly use scrapy.
You should also check if there are other options to get the data. As far as I know, it's possible to download a data dump of wikipedia. That might be overkill for your use case, but other APIs might exist.
Upvotes: 2