Reputation: 163
My code that I usually use for getting simple HTML data into a dataframe is returning the IndexError: list index out of range message when I am trying to read in listed company data from Nasdaq and cant see where the issue is.
I would really appreciate some help.
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = 'https://www.nasdaq.com/market-activity/stocks/screener'
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
tables = soup.find_all('table', rules = 'all')
table = str(tables[0]) #cast table to string
df = pd.read_html(table, skiprows=2, flavor='bs4')[0]
print(df.head())
This is the 1st HTML table I want to read into a dataframe...out of 406. Each page has 20 rows of company data.
Upvotes: 2
Views: 177
Reputation: 24928
That page is loaded dynamically using javascript, so you can't read load it with requests. Working with the Developer tab in your browser, you can intercept the xhr link generating the data, with all its parameters, and capture that. The response is a json string that can be loaded into json to convert into a dictionary which can be read by pandas into a dataframe:
import requests
import json
import pandas as pd
cookies = {
'AKA_A2': 'A',
'ak_bmsc': '26AE62881D0C356A61469B75581CA47C1743FBD2BD1D000043A5905F1822AD4D~plb7aR8lcS0B182W3yyiLmqNhQixYZnITWijGoQj/spL0g901tCb0pELApCkydFuj4bGJCEiELfcYRtc2CCQ/tGJW8qmqP0CKaWoD69VKYrrz6hStjyz5KhVZnU2OmvmzWGxB4pu6azcXssTvZF6u9VBVnrWfgskZ93v5Bzjd2dFi4GNkm0wVoriKx3EF3a42osdrS2BerWewsuue0VxZ6yplfE4lqCpJSfjCY1ZWmfWw=',
'entryUrl': 'https://www.nasdaq.com/market-activity/stocks/screener',
'entryReferringURL': '',
'bm_sv': 'D9BCD75B3FA5D7AEA644C7F771A79D5D~LRB0VO8GVndeCJY9Plht4U91bu3VCHeOYyDjNVBhBBWccsB0Qyp5kETWAi6k5W9yYuGoNpzShkqSAp2XiBV9yxtcojZ4T2srEkO++eu7cqWzDoxyGhY+5p/sdZqs1lgBTQq+U0yEnNugvWYB47+1cSKC4BmfG+38fuiAEeUGHR8=',
'bm_mi': 'F10357CB7F1B64B05A3B062B4BF82D00~4IByO6LJHitJhKvPEZPQz9+rv5diETENZZAFqqe7zVEdPRlAFti/sVtnzwy+6DrelVLk66qeOjSIoq8f9FvJxw8yJQd0yuZ5pEv955P9BBA8JSJtJBefi6A8W2HM2SrhMfdz7GvxajupdlUd2UlNo6vp9rRO4dFDmwOyV0R1yq8UtTDDICzm/2+Dsb2e33u1ToU0Y0gPIIDVRSlahJjKrhYwbr3mTXx+JJoy6612KGlZEuWQnieh31FMu1Fu0NaOUxmK9/8I2kMNKNgJh0nOuA==',
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:81.0) Gecko/20100101 Firefox/81.0',
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive',
'Referer': 'https://www.nasdaq.com/market-activity/stocks/screener',
'Cache-Control': 'max-age=0',
'TE': 'Trailers',
}
params = (
('page', '1'),
('pageSize', '20'),
)
response = requests.get('https://www.nasdaq.com/api/v1/screener', headers=headers, params=params, cookies=cookies)
table=json.loads(response.text)
df = pd.DataFrame.from_dict(table)
new_df = pd.DataFrame(list(df['data']))
print(new_df)
Output:
ticker company marketCap marketCapGroup sectorName \
0 AAPL Apple 2037310823400 Mega Technology
1 MSFT Microsoft 1624396702497 Mega Technology
2 AMZN Amazon 1611367016163 Mega Consumer Goods
3 GOOG Alphabet Class C 1058287004605 Mega Technology
etc.
Upvotes: 3