Hick
Hick

Reputation: 36394

How to parse a wikipedia page in Python?

I've been trying to parse a wikipedia page in Python and have been quite successful using the API.

But, somehow the API documentation seems a bit too skeletal for me to get all the data. As of now, I'm doing a requests.get() call to

http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=China&format=json&exintro=1

But, this only returns me the first paragraph. Not the entire page. I've tried to use allpages and search but to no avail. A better explanation of how to get the data from a wiki page would be of real help. All the data and not just the introduction as returned by the previous query.

Upvotes: 1

Views: 4162

Answers (3)

Peter Girnus
Peter Girnus

Reputation: 2729

If someone is lookin for a python3 answer here you go:

import urllib.request
    req = urllib.request.urlopen("http://en.wikipedia.org/w/api.php?action=parse&page=China&format=json&prop=text")
    print(req.read())

I'm using python version 3.7.0b4.

Upvotes: 0

Senthil Kumaran
Senthil Kumaran

Reputation: 56823

You seem to be using the query action to get the content of the page. According to it's api specs it returns only a part of the data. The proper action seems to be query.

Here is a sample

import urllib2
req = urllib2.urlopen("http://en.wikipedia.org/w/api.php?action=parse&page=China&format=json&prop=text")
content = req.read()
# content in json - use json or simplejson to get relevant sections.

Upvotes: 3

carboncrank
carboncrank

Reputation: 71

Have you considered using Beautiful Soup to extract the content from the page?

While I haven't used this for wikipedia, others have, and having used it to scrape other pages and it is an excellent tool.

Upvotes: 1

Related Questions