Reputation: 13
So I am using this series on Python for Finance and it keeps giving me error --
1) line 22, in <module> save_sp500_tickers() and
2) line 8, in save_sp500_tickers
soup = bs.BeautifulSoup(resp.text,'lxml')and
3) line 165, in __init__
% ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml.
Do you need to install a parser library?
I have been at it for a whole day and I honestly refuse to give up and any help with this would be greatly appreicated. Also if anyone has any suggestions for something other than pickle and can help write something that allows me to call the SP500 without pickle that would be great.
import bs4 as bs
import pickle
import requests
import lxml
def save_sp500_tickers():
resp = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
soup = bs.BeautifulSoup(resp.text,'lxml')
table = soup.find('table', {'class': 'wikitable sortable'})
tickers = []
for row in table.findAll('tr')[1:]:
ticker = row.findAll('td')[0].text
tickers.append(ticker)
with open("sp500tickers.pickle", "wb") as f:
pickle.dump(tickers, f)
print(tickers)
return tickers
save_sp500_tickers()
Upvotes: 0
Views: 3772
Reputation: 63473
To obtain an official list of S&P 500 symbols as constituents of the SPY ETF, pandas.read_excel
can be used. A package such as openpyxl
is also required as it is used internally by pandas
.
def list_spy_holdings() -> pd.DataFrame:
# Ref: https://stackoverflow.com/a/75845569/
# Source: https://www.ssga.com/us/en/intermediary/etfs/funds/spdr-sp-500-etf-trust-spy
# Note: One of the included holdings is CASH_USD.
url = 'https://www.ssga.com/us/en/intermediary/etfs/library-content/products/fund-data/etfs/us/holdings-daily-us-en-spy.xlsx'
return pd.read_excel(url, engine='openpyxl', index_col='Ticker', skiprows=4).dropna()
To obtain an unofficial list of S&P 500 symbols, pandas.read_html
can be used. A parser such as lxml
or bs4
+html5lib
is also required as it is used internally by pandas
.
import pandas as pd
def list_wikipedia_sp500() -> pd.DataFrame:
# Ref: https://stackoverflow.com/a/75845569/
url = 'https://en.m.wikipedia.org/wiki/List_of_S%26P_500_companies'
return pd.read_html(url, attrs={'id': 'constituents'}, index_col='Symbol')[0]
>> df = list_wikipedia_sp500()
>> df.head()
Security GICS Sector ... CIK Founded
Symbol ...
MMM 3M Industrials ... 66740 1902
AOS A. O. Smith Industrials ... 91142 1916
ABT Abbott Health Care ... 1800 1888
ABBV AbbVie Health Care ... 1551152 2013 (1888)
ACN Accenture Information Technology ... 1467373 1989
[5 rows x 7 columns]
>> symbols = df.index.to_list()
>> symbols[:5]
['MMM', 'AOS', 'ABT', 'ABBV', 'ACN']
import pandas as pd
import requests
def list_slickcharts_sp500() -> pd.DataFrame:
# Ref: https://stackoverflow.com/a/75845569/
url = 'https://www.slickcharts.com/sp500'
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0' # Default user-agent fails.
response = requests.get(url, headers={'User-Agent': user_agent})
return pd.read_html(response.text, match='Symbol', index_col='Symbol')[0]
These were tested with Pandas 1.5.3.
The results can be cached for a certain period of time, e.g. 12 hours, in memory and/or on disk, so as to avoid the risk of excessive repeated calls to the source.
A similar answer for the Nasdaq 100 is here.
Upvotes: 0
Reputation: 56674
Running your code as-is works on my system. Probably, as Eric suggests, you should install lxml.
Unfortunately if you are on Windows pip install lxml
does not work unless you have a whole compiler infrastructure set up, which you probably don't.
Luckily you can get a precompiled binary installer from http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml - make sure you pick the one that matches your version of python and whether it is 32 or 64 bit.
Edit: just for interest, try changing the line
soup = bs.BeautifulSoup(resp.text, 'html.parser') # use Python's built-in parser instead
See https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for a list of available parsers.
Upvotes: 2