Reputation: 1126
Using the below code, I am able to fetch "soup" without an issue. My goal is to ultimately fetch the title within the soup object, but I'm having trouble figuring out how to do it. In addition to below, I've also tried various iterations of soup['results'], soup.results, soup.get_text().results .. etc and not sure how to get to it. I can, of course, do soup.get_text() ... (some kind of search function for the string "title," but feel like there has to be a built-in method for this.
55)get_title()
54 ipdb.set_trace()
---> 55 title = soup.html.head.title.string
56 title = re.sub(r'[^\x00-\x7F]+',' ', title)
ipdb> type(soup)
<class 'bs4.BeautifulSoup'>
ipdb> soup.title
ipdb> print soup.title
None
ipdb> soup
{"status":"OK","copyright":"Copyright (c) 2018 The New York Times Company. All Rights Reserved.","section":"home","last_updated":"2018-01-07T06:19:00-05:00","num_results":42,"results":[{"section":"Briefing","subsection":"",**"title":"Trump, Palestinians, Golden Globes: Your Weekend Briefing"**, ....
Code
from __future__ import division
import regex as re
import string
import urllib2
from bs4 import BeautifulSoup
from cookielib import CookieJar
import ipdb
PARSER_TYPE = 'html.parser'
def get_title(url):
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
p = opener.open(url)
soup = BeautifulSoup(p.read(), PARSER_TYPE) # This loads fine
ipdb.set_trace()
title = soup.html.head.title.string # This is sad
title = re.sub(r'[^\x00-\x7F]+',' ', title)
return title
Upvotes: 1
Views: 95
Reputation: 87134
Take a look at what p.read()
returns. You will find that it is not HTML, it is a JSON string. You can't use a HTML parser to successfully parse JSON, however, you can use a JSON parser such as the one provided in the json
package.
import json
p = opener.open(url)
response = json.loads(p.read())
Following this response
will reference a dictionary. You can then use dictionary access methods to extract a particular piece of data:
title = response['results'][0]['title']
Note here that response['results']
is itself a list
so you need to get the first element of that list (at least for the example that you've shown). response['results'][0]
then gives a second nested dictionary that contains the data that you want. Look that up with the title
key.
Since the results are contained in a list you might need to iterate over that list to process each result:
for result in response['results']:
print(result['title'])
If some results do not have title keys you can use dict.get()
to perform the lookup without raising an exception:
for result in response['results']:
print(result.get('title'))
Upvotes: 2