Reputation: 345
I want to extract the information in the Infobox from specific Wikipedia pages, mainly countries. Specifically I want to achieve this without scraping the page using Python
+ BeautifulSoup4
or any other languages + libraries, if possible. I'd rather use the official API, because I noticed the CSS tags are different for different Wikipedia subdomains (as in other languages).
In How to get Infobox from a Wikipedia article by Mediawiki API? states that using the following method would work, which is indeed true for the given tital (Scary Monsters and Nice Sprites
), but unfortunately doesn't work on the pages I tried (further below).
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=Scary%20Monsters%20and%20Nice%20Sprites&rvsection=0
However, I suppose Wikimedia
changed their infobox
template, because when I run the above query all I get is the content, but not the infobox
. E.g. running the query on Europäische_Union
(European_Union) results (among others) in the following snippet
{{Infobox Europäische Union}}
<!--{{Infobox Staat}} <- Vorlagen-Parameter liegen in [[Spezial:Permanenter Link/108232313]] -->
It works fine for the English version of Wikipedia though.
So the page I want to extract the infobox from would be: http://de.wikipedia.org/wiki/Europäische_Union
And this is the code I'm using:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import lxml.etree
import urllib
title = "Europäische_Union"
params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"content", "rvsection":0 }
params["titles"] = "API|%s" % urllib.quote(title.encode("utf8"))
qs = "&".join("%s=%s" % (k, v) for k, v in params.items())
url = "http://de.wikipedia.org/w/api.php?%s" % qs
tree = lxml.etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')
print revs[-1].text
Am I missing something very substantial?
Upvotes: 1
Views: 961
Reputation: 2544
Data must not be taken from Wikipedia, but from Wikidata which is Wikipedia's structured data counterpart. (Also, that's not a standard infobox: it has no parameters and it's filled on the template itself.)
Use the Wikidata API module wbgetclaims to get all the data on the European Union:
https://www.wikidata.org/w/api.php?action=wbgetclaims&entity=Q458
Much neater, eh? See https://www.wikidata.org/wiki/Wikidata:Data_access for more.
Upvotes: 1