JBWhitmore
JBWhitmore

Reputation: 12296

Parse birth and death dates from Wikipedia?

I'm trying to write a python program that can search wikipedia for the birth and death dates for people.

For example, Albert Einstein was born: 14 March 1879; died: 18 April 1955.

I started with Fetch a Wikipedia article with Python

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=xml')
page2 = infile.read()

This works as far as it goes. page2 is the xml representation of the section from Albert Einstein's wikipedia page.

And I looked at this tutorial, now that I have the page in xml format... http://www.travisglines.com/web-coding/python-xml-parser-tutorial, but I don't understand how to get the information I want (birth and death dates) out of the xml. I feel like I must be close, and yet, I have no idea how to proceed from here.

EDIT

After a few responses, I've installed BeautifulSoup. I'm now at the stage where I can print:

import BeautifulSoup as BS
soup = BS.BeautifulSoup(page2)
print soup.getText()
{{Infobox scientist
| name        = Albert Einstein
| image       = Einstein 1921 portrait2.jpg
| caption     = Albert Einstein in 1921
| birth_date  = {{Birth date|df=yes|1879|3|14}}
| birth_place = [[Ulm]], [[Kingdom of Württemberg]], [[German Empire]]
| death_date  = {{Death date and age|df=yes|1955|4|18|1879|3|14}}
| death_place = [[Princeton, New Jersey|Princeton]], New Jersey, United States
| spouse      = [[Mileva Marić]] (1903–1919)<br>{{nowrap|[[Elsa Löwenthal]] (1919–1936)}}
| residence   = Germany, Italy, Switzerland, Austria, Belgium, United Kingdom, United States
| citizenship = {{Plainlist|
* [[Kingdom of Württemberg|Württemberg/Germany]] (1879–1896)
* [[Statelessness|Stateless]] (1896–1901)
* [[Switzerland]] (1901–1955)
* [[Austria–Hungary|Austria]] (1911–1912)
* [[German Empire|Germany]] (1914–1933)
* United States (1940–1955)
}}

So, much closer, but I still don't know how to return the death_date in this format. Unless I start parsing things with re? I can do that, but I feel like I'd be using the wrong tool for this job.

Upvotes: 10

Views: 8605

Answers (6)

Jason Sundram
Jason Sundram

Reputation: 12564

I came across this question and appreciated all the useful information that was provided in @Yoshiki's answer, but it took some synthesizing to get to a working solution. Sharing here in case it's useful for anyone else. The code is also in this gist for those who wish to fork / improve it.

In particular, there's not much in the way of error handling here ...

import csv
from datetime import datetime
import json
import requests
from dateutil import parser


def id_for_page(page):
    """Uses the wikipedia api to find the wikidata id for a page"""
    api = "https://en.wikipedia.org/w/api.php"
    query = "?action=query&prop=pageprops&titles=%s&format=json"
    slug = page.split('/')[-1]

    response = json.loads(requests.get(api + query % slug).content)
    # Assume we got 1 page result and it is correct.
    page_info = list(response['query']['pages'].values())[0]
    return  page_info['pageprops']['wikibase_item']


def lifespan_for_id(wikidata_id):
    """Uses the wikidata API to retrieve wikidata for the given id."""
    data_url = "https://www.wikidata.org/wiki/Special:EntityData/%s.json"
    page = json.loads(requests.get(data_url % wikidata_id).content)

    claims = list(page['entities'].values())[0]['claims']
    # P569 (birth) and P570 (death) ... not everyone has died yet.
    return [get_claim_as_time(claims, cid) for cid in ['P569', 'P570']]


def get_claim_as_time(claims, claim_id):
    """Helper function to work with data returned from wikidata api"""
    try:
        claim = claims[claim_id][0]['mainsnak']['datavalue']
        assert claim['type'] == 'time', "Expecting time data type"

        # dateparser chokes on leading '+', thanks wikidata.
        return parser.parse(claim['value']['time'][1:])
    except KeyError as e:
        print(e)
        return None


def main():
    page = 'https://en.wikipedia.org/wiki/Albert_Einstein'

    # 1. use the wikipedia api to find the wikidata id for this page
    wikidata_id = id_for_page(page)

    # 2. use the wikidata id to get the birth and death dates
    span = lifespan_for_id(wikidata_id)

    for label, dt in zip(["birth", "death"], span):
        print(label, " = ", datetime.strftime(dt, "%b %d, %Y"))

Upvotes: 2

Yoshiki
Yoshiki

Reputation: 995

One alternative in 2019 is to use the Wikidata API, which, among other things, exposes biographical data like birth and death dates in a structured format that is very easy to consume without any custom parsers. Many Wikipedia articles depend on Wikidata for their info, so in many cases this will be the same as if you were consuming Wikipedia data.

For example, look at the Wikidata page for Albert Einstein and search for "date of birth" and "date of death", you will find they are the same as in Wikipedia. Every entity in Wikidata has a list of "claims" which are pairs of "properties" and "values". To know when Einstein was born and died, we only need to search the list of statements for the appropriate properties, in this case, P569 and P570. To do this programatically, it's best to access the entity as json, which you can do with the following url structure:

https://www.wikidata.org/wiki/Special:EntityData/Q937.json

And as an example, here is what the claim P569 states about Einstein:

        "P569": [
          {
            "mainsnak": {
              "property": "P569",
              "datavalue": {
                "value": {
                  "time": "+1879-03-14T00:00:00Z",
                  "timezone": 0,
                  "before": 0,
                  "after": 0,
                  "precision": 11,
                  "calendarmodel": "http://www.wikidata.org/entity/Q1985727"
                },
                "type": "time"
              },
              "datatype": "time"
            },
            "type": "statement",

You can learn more about accessing Wikidata in this article, and more specifically about how dates are structured in Help:Dates.

Upvotes: 1

Jobjörn Folkesson
Jobjörn Folkesson

Reputation: 569

The persondata template is deprecated now, and you should instead access Wikidata. See Wikidata:Data access. My earlier (now deprecated) answer from 2012 was as follows:

What you should do is to parse the {{persondata}} template found in most biographical articles. There are existing tools for easily extracting such data programmatically, with your existing knowledge and the other helpful answers I am sure you can make that work.

Upvotes: 1

Tgr
Tgr

Reputation: 28220

First, use pywikipedia. It allows you to query article text, template parameters etc. through a high-level abstract interface. Second, I would go with the Persondata template (look towards the end of the article). Also, in the long term, you might be interested in Wikidata, which will take several months to introduce, but it will make most metadata in Wikipedia articles easily queryable.

Upvotes: 5

K Z
K Z

Reputation: 30483

You can consider using a library such as BeautifulSoup or lxml to parse the response html/xml.

You may also want to take a look at Requests, which has a much cleaner API for making requests.


Here is the working code using Requests, BeautifulSoup and re, arguably not the best solution here, but it is quite flexible and can be extended for similar problems:

import re
import requests
from bs4 import BeautifulSoup

url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=xml'

res = requests.get(url)
soup = BeautifulSoup(res.text, "xml")

birth_re = re.search(r'(Birth date(.*?)}})', soup.revisions.getText())
birth_data = birth_re.group(0).split('|')
birth_year = birth_data[2]
birth_month = birth_data[3]
birth_day = birth_data[4]

death_re = re.search(r'(Death date(.*?)}})', soup.revisions.getText())
death_data = death_re.group(0).split('|')
death_year = death_data[2]
death_month = death_data[3]
death_day = death_data[4]

Per @JBernardo's suggestion using JSON data and mwparserfromhell, a better answer for this particular use case:

import requests
import mwparserfromhell

url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=json'

res = requests.get(url)
text = res.json["query"]["pages"].values()[0]["revisions"][0]["*"]
wiki = mwparserfromhell.parse(text)

birth_data = wiki.filter_templates(matches="Birth date")[0]
birth_year = birth_data.get(1).value
birth_month = birth_data.get(2).value
birth_day = birth_data.get(3).value

death_data = wiki.filter_templates(matches="Death date")[0]
death_year = death_data.get(1).value
death_month = death_data.get(2).value
death_day = death_data.get(3).value

Upvotes: 8

JBernardo
JBernardo

Reputation: 33407

First: The wikipedia API allows the use of JSON instead of XML and that will make things much easier.

Second: There's no need to use HTML/XML parsers at all (the content is not HTML nor the container need to be). What you need to parse is this Wiki format inside "revisions" tag of the JSON.

Check some Wiki parsers here


What seems to be confusing here is that the API allows you to request a certain format (XML or JSON) but that's is just a container for some text in the real format you want to parse:

This one: {{Birth date|df=yes|1879|3|14}}

With one of the parsers provided in the link above, you will be able to do that.

Upvotes: 6

Related Questions