shakstroworld
shakstroworld

Reputation: 1

Scraping from a dropdown menu using beautifulsoup

I am trying to scrape a list of dates from: https://ca.finance.yahoo.com/quote/AAPL/options

The dates are located within a drop down menu right above the option chain. I've scraped text from this website before but this text is using a 'select' & 'option' syntax. How would I adjust my code to gather this type of text? I have used many variations of the code below to try and scrape the text but am having no luck.

Thank you very much.

    import bs4
    import requests

    datesLink = ('https://ca.finance.yahoo.com/quote/AAPL/options')
    datesPage = requests.get(datesLink)
    datesSoup = BeautifulSoup(datesPage.text, 'lxml')

    datesQuote = datesSoup.find('div', {'class': 'Cf Pt(18px)controls'}).find('option').text

Upvotes: 0

Views: 205

Answers (2)

Boris Lipschitz
Boris Lipschitz

Reputation: 1641

The reason you can't seem to extract this dropdown list is because this list is generated dynamically, and the easiest way to know this is by saving your html content into a file and giving it a manual look, in a text editor.

You CAN, however, parse those dates out of the script source code, which is in the same html file, using some ugly regex way. For example, this seems to work:

import requests, re
from datetime import *

content = requests.get('https://ca.finance.yahoo.com/quote/AAPL/options').content.decode()
match = re.search(r'"OptionContractsStore".*?"expirationDates".*?\[(.*?)\]', content)
dates = [datetime.fromtimestamp(int(x), tz=timezone.utc) for x in match.group(1).split(',')]

for d in dates:
    print(d.strftime('%Y-%m-%d'))

It should be obvious that parsing stuff in such a nasty way isn't fool-proof, and likely going to break sooner rather than later. But the same can be said about any kind of web scraping entirely.

Upvotes: 1

Prayson W. Daniel
Prayson W. Daniel

Reputation: 15606

You can simply read HTML directly to Pandas:


import pandas as pd
URI = 'https://ca.finance.yahoo.com/quote/AAPL/options'

df = pd.read_html(URI)[0] #[1] depending on the table you wish for

Upvotes: 0

Related Questions