flw
flw

Reputation: 47

JSONDecodeError when making an API request

I am trying to make an API request of Washington Post and extract all articles matching my search query.

import requests
import json
import pandas as pd

#---------Define Parameters for API access
params = {
    "count": "100",
    "datefilter":"displaydatetime:[NOW/DAY-1YEAR TO NOW/DAY+1DAY]",
    "facets.fields":"{!ex=include}contenttype,{!ex=include}name",
    "highlight.fields":"headline,body",
    "highlight.on":"true",
    "highlight.snippets":"1",
    "query":"coronavirus",
    "sort":"displaydatetime desc",
    "startat": "0",
    "callback":"angular.callbacks._0"}

#----------Define Funktion
def WP_Scraper(url):
 #-------------Define empty lists to be scraped
    WP_title   = []
    WP_date   = []
    WP_article   = []
    WP_link = []
    
    with requests.Session() as req:
        for item in range(0, 9527, 100):
            print(f"Extracting Article# {item +1}")
            params["startat"] = item
            r = req.get(url, params=params).json()
            for loop in r['results']:
                WP_title.append(loop['headline'])
                WP_date.append(loop['pubdatetime'])
                WP_link.append(loop['contenturl'])
                WP_article.append(loop['blurb'])
                
 #-------------Save in DF                  
    df = pd.DataFrame()
    df['title'] = WP_title
    df['date'] = WP_date      
    df['article'] = WP_article 
    df['link']=WP_link
    return df  

WP_data = WP_Scraper("https://sitesearchapp.washingtonpost.com/sitesearch-api/v2/search.json")

I get the following error when calling the function: enter image description here

Does anyone know what is causing the error or if there is a more efficient method?

I searched stackoverflow for this answer. If this is a duplicate, please point me in the right direction. Thanks in advance.

Upvotes: 0

Views: 91

Answers (1)

Badgy
Badgy

Reputation: 819

Looking at the result, the JSON is wrapped in /**/angular.callbacks._0();. You should strip this before converting to JSON, so you could do something like

r = json.loads(req.get(url, params=params).content.decode('utf-8').strip('/**/angular.callbacks._0();'))

in your request loop. Also, your nested loop is a bit off from what I understand in the JSON structure, the articles are contained in the documents pair, and blurb is only present sometimes, so try this

for loop in r['results']['documents']:
    WP_title.append(loop['headline'])
    WP_date.append(loop['pubdatetime'])
    WP_link.append(loop['contenturl'])
    try:
        WP_article.append(loop['blurb'])
    except KeyError:
        pass

Upvotes: 1

Related Questions