ian-campbell
ian-campbell

Reputation: 1665

Wikipedia API JSON with Python

I want to make a Python list of all of Vincent van Gogh's paintings out of the JSON file from a Wikipedia API call. Here is my URL that I use to make the request:

http://en.wikipedia.org/w/api.php?format=json&action=query&titles=list%20of%20works%20by%20Vincent%20van%20Gogh&Page&prop=revisions&rvprop=content

As you can see if you open the URL in your browser, it's a huge blob of text. How can I begin to extract the titles of paintings from this massive JSON return? I have done a great deal of research before asking this question, and tried numerous methods to solve it. It would be helpful if this JSON file was a useful dictionary to work with, but I can't make sense of it. How would you extract names of paintings from this JSON file?

Upvotes: 2

Views: 4732

Answers (2)

stallingOne
stallingOne

Reputation: 4006

Here is a quick way to have your list in a panda dataframe

import pandas as pd
url = 'http://en.wikipedia.org/wiki/List_of_works_by_Vincent_van_Gogh'
df = pd.read_html(url, attrs={"class": "wikitable"})[0] # 0 is for the 1st table in this particular page
df.head()

Upvotes: 0

alecxe
alecxe

Reputation: 473763

Instead of directly parsing the results of JSON API calls, use a python wrapper:

import wikipedia

page = wikipedia.page("List_of_works_by_Vincent_van_Gogh")
print page.links

There are also other clients and wrappers.

Alternatively, here's an option using BeautifulSoup HTML parser:

>>> from bs4 import BeautifulSoup
>>> url = "http://en.wikipedia.org/wiki/List_of_works_by_Vincent_van_Gogh"
>>> soup = BeautifulSoup(urlopen(url))
>>> table = soup.find('table', class_="wikitable")
>>> for row in table.find_all('tr')[1:]:
...     print(row.find_all('td')[1].text)
... 
Still Life with Cabbage and Clogs
Crouching Boy with Sickle, Black chalk and watercolor
Woman Sewing, Watercolor
Woman with White Shawl
...

Upvotes: 6

Related Questions