Reputation: 47
I am trying to make an API request of Washington Post and extract all articles matching my search query.
import requests
import json
import pandas as pd
#---------Define Parameters for API access
params = {
"count": "100",
"datefilter":"displaydatetime:[NOW/DAY-1YEAR TO NOW/DAY+1DAY]",
"facets.fields":"{!ex=include}contenttype,{!ex=include}name",
"highlight.fields":"headline,body",
"highlight.on":"true",
"highlight.snippets":"1",
"query":"coronavirus",
"sort":"displaydatetime desc",
"startat": "0",
"callback":"angular.callbacks._0"}
#----------Define Funktion
def WP_Scraper(url):
#-------------Define empty lists to be scraped
WP_title = []
WP_date = []
WP_article = []
WP_link = []
with requests.Session() as req:
for item in range(0, 9527, 100):
print(f"Extracting Article# {item +1}")
params["startat"] = item
r = req.get(url, params=params).json()
for loop in r['results']:
WP_title.append(loop['headline'])
WP_date.append(loop['pubdatetime'])
WP_link.append(loop['contenturl'])
WP_article.append(loop['blurb'])
#-------------Save in DF
df = pd.DataFrame()
df['title'] = WP_title
df['date'] = WP_date
df['article'] = WP_article
df['link']=WP_link
return df
WP_data = WP_Scraper("https://sitesearchapp.washingtonpost.com/sitesearch-api/v2/search.json")
I get the following error when calling the function:
Does anyone know what is causing the error or if there is a more efficient method?
I searched stackoverflow for this answer. If this is a duplicate, please point me in the right direction. Thanks in advance.
Upvotes: 0
Views: 91
Reputation: 819
Looking at the result, the JSON is wrapped in /**/angular.callbacks._0();
. You should strip this before converting to JSON, so you could do something like
r = json.loads(req.get(url, params=params).content.decode('utf-8').strip('/**/angular.callbacks._0();'))
in your request loop. Also, your nested loop is a bit off from what I understand in the JSON structure, the articles are contained in the documents
pair, and blurb
is only present sometimes, so try this
for loop in r['results']['documents']:
WP_title.append(loop['headline'])
WP_date.append(loop['pubdatetime'])
WP_link.append(loop['contenturl'])
try:
WP_article.append(loop['blurb'])
except KeyError:
pass
Upvotes: 1