reading HTML for listed company data into Dataframe

Question

My code that I usually use for getting simple HTML data into a dataframe is returning the IndexError: list index out of range message when I am trying to read in listed company data from Nasdaq and cant see where the issue is.

I would really appreciate some help.

from bs4 import BeautifulSoup
import pandas as pd
import requests

url = 'https://www.nasdaq.com/market-activity/stocks/screener'

headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"}

response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.content, 'html.parser')

tables = soup.find_all('table', rules = 'all')

table = str(tables[0]) #cast table to string

df = pd.read_html(table, skiprows=2, flavor='bs4')[0]
print(df.head())

This is the 1st HTML table I want to read into a dataframe...out of 406. Each page has 20 rows of company data.

Jack Fleeting · Accepted Answer

That page is loaded dynamically using javascript, so you can't read load it with requests. Working with the Developer tab in your browser, you can intercept the xhr link generating the data, with all its parameters, and capture that. The response is a json string that can be loaded into json to convert into a dictionary which can be read by pandas into a dataframe:

import requests
import json
import pandas as pd

cookies = {
    'AKA_A2': 'A',
    'ak_bmsc': '26AE62881D0C356A61469B75581CA47C1743FBD2BD1D000043A5905F1822AD4D~plb7aR8lcS0B182W3yyiLmqNhQixYZnITWijGoQj/spL0g901tCb0pELApCkydFuj4bGJCEiELfcYRtc2CCQ/tGJW8qmqP0CKaWoD69VKYrrz6hStjyz5KhVZnU2OmvmzWGxB4pu6azcXssTvZF6u9VBVnrWfgskZ93v5Bzjd2dFi4GNkm0wVoriKx3EF3a42osdrS2BerWewsuue0VxZ6yplfE4lqCpJSfjCY1ZWmfWw=',
    'entryUrl': 'https://www.nasdaq.com/market-activity/stocks/screener',
    'entryReferringURL': '',
    'bm_sv': 'D9BCD75B3FA5D7AEA644C7F771A79D5D~LRB0VO8GVndeCJY9Plht4U91bu3VCHeOYyDjNVBhBBWccsB0Qyp5kETWAi6k5W9yYuGoNpzShkqSAp2XiBV9yxtcojZ4T2srEkO++eu7cqWzDoxyGhY+5p/sdZqs1lgBTQq+U0yEnNugvWYB47+1cSKC4BmfG+38fuiAEeUGHR8=',
    'bm_mi': 'F10357CB7F1B64B05A3B062B4BF82D00~4IByO6LJHitJhKvPEZPQz9+rv5diETENZZAFqqe7zVEdPRlAFti/sVtnzwy+6DrelVLk66qeOjSIoq8f9FvJxw8yJQd0yuZ5pEv955P9BBA8JSJtJBefi6A8W2HM2SrhMfdz7GvxajupdlUd2UlNo6vp9rRO4dFDmwOyV0R1yq8UtTDDICzm/2+Dsb2e33u1ToU0Y0gPIIDVRSlahJjKrhYwbr3mTXx+JJoy6612KGlZEuWQnieh31FMu1Fu0NaOUxmK9/8I2kMNKNgJh0nOuA==',
}

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:81.0) Gecko/20100101 Firefox/81.0',
    'Accept': 'application/json, text/plain, */*',
    'Accept-Language': 'en-US,en;q=0.5',
    'Connection': 'keep-alive',
    'Referer': 'https://www.nasdaq.com/market-activity/stocks/screener',
    'Cache-Control': 'max-age=0',
    'TE': 'Trailers',
}

params = (
    ('page', '1'),
    ('pageSize', '20'),
)

response = requests.get('https://www.nasdaq.com/api/v1/screener', headers=headers, params=params, cookies=cookies)

table=json.loads(response.text)
df = pd.DataFrame.from_dict(table)
new_df = pd.DataFrame(list(df['data']))
print(new_df)

Output:

ticker               company      marketCap marketCapGroup      sectorName  \
0    AAPL                 Apple  2037310823400           Mega      Technology   
1    MSFT             Microsoft  1624396702497           Mega      Technology   
2    AMZN                Amazon  1611367016163           Mega  Consumer Goods   
3    GOOG      Alphabet Class C  1058287004605           Mega      Technology

etc.

reading HTML for listed company data into Dataframe

Answers (1)

Related Questions