Faiz
Faiz

Reputation: 153

Extracting json when web scraping

I was following a python guide on web scraping and there's one line of code that won't work for me. I'd appreciate it if anybody could help me figure out what the issue is, thanks.

from bs4 import BeautifulSoup
import json
import re
import requests

url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
script = soup.find('script', text=re.compile('root\.App\.main'))

json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1)

Error Message:

    json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1)
AttributeError: 'NoneType' object has no attribute 'string'

Link to the guide I was looking at: https://www.mattbutton.com/how-to-scrape-stock-upgrades-and-downgrades-from-yahoo-finance/

Upvotes: 0

Views: 566

Answers (1)

HedgeHog
HedgeHog

Reputation: 25196

Main issue in my opinion is that you should add an user-agent to your request, so that you get expected HTML:

headers =   {'user-agent':'Mozilla/5.0'}
page = requests.get(url, headers=headers)

Note: Almost and first at all - Take a deeper look into your soup, to check if expected information is available.

Example

import re
import json
from bs4 import BeautifulSoup

url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
headers =   {'user-agent':'Mozilla/5.0'}

page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content)

script = soup.find('script', text=re.compile('root\.App\.main'))

json_text = json.loads(re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1))
json_text

Upvotes: 1

Related Questions