Reputation: 36394
I've been trying to parse a wikipedia page in Python and have been quite successful using the API.
But, somehow the API documentation seems a bit too skeletal for me to get all the data. As of now, I'm doing a requests.get() call to
http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=China&format=json&exintro=1
But, this only returns me the first paragraph. Not the entire page. I've tried to use allpages and search but to no avail. A better explanation of how to get the data from a wiki page would be of real help. All the data and not just the introduction as returned by the previous query.
Upvotes: 1
Views: 4162
Reputation: 2729
If someone is lookin for a python3 answer here you go:
import urllib.request
req = urllib.request.urlopen("http://en.wikipedia.org/w/api.php?action=parse&page=China&format=json&prop=text")
print(req.read())
I'm using python version 3.7.0b4.
Upvotes: 0
Reputation: 56823
You seem to be using the query action to get the content of the page. According to it's api specs it returns only a part of the data. The proper action seems to be query.
Here is a sample
import urllib2
req = urllib2.urlopen("http://en.wikipedia.org/w/api.php?action=parse&page=China&format=json&prop=text")
content = req.read()
# content in json - use json or simplejson to get relevant sections.
Upvotes: 3
Reputation: 71
Have you considered using Beautiful Soup to extract the content from the page?
While I haven't used this for wikipedia, others have, and having used it to scrape other pages and it is an excellent tool.
Upvotes: 1