stk1234
stk1234

Reputation: 1126

Python BS4 Not Retrieving Results

Using the below code, I am able to fetch "soup" without an issue. My goal is to ultimately fetch the title within the soup object, but I'm having trouble figuring out how to do it. In addition to below, I've also tried various iterations of soup['results'], soup.results, soup.get_text().results .. etc and not sure how to get to it. I can, of course, do soup.get_text() ... (some kind of search function for the string "title," but feel like there has to be a built-in method for this.

55)get_title()
     54     ipdb.set_trace()
---> 55     title = soup.html.head.title.string
     56     title = re.sub(r'[^\x00-\x7F]+',' ', title)

ipdb> type(soup)
<class 'bs4.BeautifulSoup'>
ipdb> soup.title
ipdb> print soup.title
None
ipdb> soup
{"status":"OK","copyright":"Copyright (c) 2018 The New York Times Company. All Rights Reserved.","section":"home","last_updated":"2018-01-07T06:19:00-05:00","num_results":42,"results":[{"section":"Briefing","subsection":"",**"title":"Trump, Palestinians, Golden Globes: Your Weekend Briefing"**, ....

Code

from __future__ import division

import regex as re
import string
import urllib2

from bs4 import BeautifulSoup
from cookielib import CookieJar
import ipdb

PARSER_TYPE = 'html.parser'

def get_title(url):
    cj = CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    p = opener.open(url)
    soup = BeautifulSoup(p.read(), PARSER_TYPE) # This loads fine
    ipdb.set_trace()
    title = soup.html.head.title.string # This is sad
    title = re.sub(r'[^\x00-\x7F]+',' ', title)
    return title

Upvotes: 1

Views: 95

Answers (1)

mhawke
mhawke

Reputation: 87134

Take a look at what p.read() returns. You will find that it is not HTML, it is a JSON string. You can't use a HTML parser to successfully parse JSON, however, you can use a JSON parser such as the one provided in the json package.

import json

p = opener.open(url)
response = json.loads(p.read())

Following this response will reference a dictionary. You can then use dictionary access methods to extract a particular piece of data:

title = response['results'][0]['title']

Note here that response['results'] is itself a list so you need to get the first element of that list (at least for the example that you've shown). response['results'][0] then gives a second nested dictionary that contains the data that you want. Look that up with the title key.

Since the results are contained in a list you might need to iterate over that list to process each result:

for result in response['results']:
    print(result['title'])

If some results do not have title keys you can use dict.get() to perform the lookup without raising an exception:

for result in response['results']:
    print(result.get('title'))

Upvotes: 2

Related Questions