equanimity
equanimity

Reputation: 2533

How to strip columns from dataframe using pd.read_html and return output as a list

I'm trying to get a list of stock symbols using the Pandas read_html function (instead of using Beautiful Soup to scrape the web).

The website I'm referencing is:

https://en.wikipedia.org/wiki/List_of_S%26P_500_companies

The desired output is:

['MMM', 'ABT', 'ABBV', 'ACN', 'ATVI' ... ] 

My code is:

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
df = pd.read_html(url)[0]
#df.columns = df.iloc[0]
df.drop(df.index[0], inplace=True)
tickers = df['Symbol'].tolist()

The output of this code is a dataframe that looks as follows:

df.head()

    Symbol  Security    SEC filings GICS Sector GICS Sub Industry   Headquarters Location   Date first added    CIK Founded
1   ABT Abbott Laboratories reports Health Care Health Care Equipment   North Chicago, Illinois 1964-03-31  1800    1888
2   ABBV    AbbVie Inc. reports Health Care Pharmaceuticals North Chicago, Illinois 2012-12-31  1551152 2013 (1888)
3   ABMD    ABIOMED Inc reports Health Care Health Care Equipment   Danvers, Massachusetts  2018-05-31  815094  1981
4   ACN Accenture plc   reports Information Technology  IT Consulting & Other Services  Dublin, Ireland 2011-07-06  1467373 1989
5   ATVI    Activision Blizzard reports Communication Services  Interactive Home Entertainment  Santa Monica, California    2015-08-31  718877  2008

If I uncomment df.columns = df.iloc[0], then Pandas throws the following error message

KeyError: 'Symbol'

The line df.iloc[0] returns:

Symbol                                       ABT
Security                     Abbott Laboratories
SEC filings                              reports
GICS Sector                          Health Care
GICS Sub Industry          Health Care Equipment
Headquarters Location    North Chicago, Illinois
Date first added                      1964-03-31
CIK                                         1800
Founded                                     1888

Which is not what I'm looking for (rather, the header row before this one that contains the 'Symbol' column).

Does anyone see what I'm doing incorrectly here? Thanks!

Upvotes: 1

Views: 902

Answers (2)

bharatk
bharatk

Reputation: 4315

Using pandas library to read html table data. tolist() is used to convert a series to list.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
df = pd.read_html(url)[0]
# filter table data by selected columns
df = df[['Symbol']]
tickers = df['Symbol'].tolist()

print(tickers)

O/P:

['MMM', 'ABT', 'ABBV', 'ABMD', 'ACN', 'ATVI', 'ADBE', 'AMD', 'AAP', 'AES', 'AMG', 'AFL', 'A', 'APD', 'AKAM', 'ALK', 'ALB', 'ARE', 'ALXN', 'ALGN', 'ALLE', 'AGN', 'ADS', 'LNT', 'ALL', 'GOOGL', 'GOOG', 'MO', 'AMZN', 'AMCR', 'AEE', 'AAL', 'AEP', 'AXP', 'AIG', 'AMT', 'AWK', 'AMP', 'ABC', 'AME', 'AMGN', 'APH', 'APC', 'ADI', 'ANSS', 'ANTM', 'AON', 'AOS', 'APA', 'AIV', 'AAPL', 'AMAT', 'APTV', 'ADM', 'ARNC', 'ANET', 'AJG', 'AIZ', 'ATO', 'T', 'ADSK', 'ADP', 'AZO', 'AVB', 'AVY', 'BHGE', 'BLL', 'BAC', 'BK', 'BAX', 'BBT', 'BDX', 'BRK.B', 'BBY', 'BIIB', 'BLK', 'HRB', 'BA', 'BKNG', 'BWA', 'BXP', 'BSX', 'BMY', 'AVGO', 'BR', 'BF.B', 'CHRW', 'COG', 'CDNS', 'CPB', 'COF', 'CPRI', 'CAH', 'KMX', 'CCL', 'CAT', 'CBOE', 'CBRE', 'CBS', 'CE', 'CELG', 'CNC', 'CNP', 'CTL', 'CERN', 'CF', 'SCHW', 'CHTR', 'CVX', 'CMG', 'CB', 'CHD', 'CI', 'XEC', 'CINF', 'CTAS', 'CSCO', 'C', 'CFG', 'CTXS', 'CLX', 'CME', 'CMS', 'KO', 'CTSH', 'CL', 'CMCSA', 'CMA', 'CAG', 'CXO', 'COP', 'ED', 'STZ', 'COO', 'CPRT', 'GLW', 'CTVA', 'COST', 'COTY', 'CCI', 'CSX', 'CMI', 'CVS', 'DHI', 'DHR', 'DRI', 'DVA', 'DE', 'DAL', 'XRAY', 'DVN', 'FANG', 'DLR', 'DFS', 'DISCA', 'DISCK', 'DISH', 'DG', 'DLTR', 'D', 'DOV', 'DOW', 'DTE', 'DUK', 'DRE', 'DD', 'DXC', 'ETFC', 'EMN', 'ETN', 'EBAY', 'ECL', 'EIX', 'EW', 'EA', 'EMR', 'ETR', 'EOG', 'EFX', 'EQIX', 'EQR', 'ESS', 'EL', 'EVRG', 'ES', 'RE', 'EXC', 'EXPE', 'EXPD', 'EXR', 'XOM', 'FFIV', 'FB', 'FAST', 'FRT', 'FDX', 'FIS', 'FITB', 'FE', 'FRC', 'FISV', 'FLT', 'FLIR', 'FLS', 'FMC', 'FL', 'F', 'FTNT', 'FTV', 'FBHS', 'FOXA', 'FOX', 'BEN', 'FCX', 'GPS', 'GRMN', 'IT', 'GD', 'GE', 'GIS', 'GM', 'GPC', 'GILD', 'GPN', 'GS', 'GWW', 'HAL', 'HBI', 'HOG', 'HIG', 'HAS', 'HCA', 'HCP', 'HP', 'HSIC', 'HSY', 'HES', 'HPE', 'HLT', 'HFC', 'HOLX', 'HD', 'HON', 'HRL', 'HST', 'HPQ', 'HUM', 'HBAN', 'HII', 'IDXX', 'INFO', 'ITW', 'ILMN', 'IR', 'INTC', 'ICE', 'IBM', 'INCY', 'IP', 'IPG', 'IFF', 'INTU', 'ISRG', 'IVZ', 'IPGP', 'IQV', 'IRM', 'JKHY', 'JEC', 'JBHT', 'JEF', 'SJM', 'JNJ', 'JCI', 'JPM', 'JNPR', 'KSU', 'K', 'KEY', 'KEYS', 'KMB', 'KIM', 'KMI', 'KLAC', 'KSS', 'KHC', 'KR', 'LB', 'LHX', 'LH', 'LRCX', 'LW', 'LEG', 'LEN', 'LLY', 'LNC', 'LIN', 'LKQ', 'LMT', 'L', 'LOW', 'LYB', 'MTB', 'MAC', 'M', 'MRO', 'MPC', 'MKTX', 'MAR', 'MMC', 'MLM', 'MAS', 'MA', 'MKC', 'MXIM', 'MCD', 'MCK', 'MDT', 'MRK', 'MET', 'MTD', 'MGM', 'MCHP', 'MU', 'MSFT', 'MAA', 'MHK', 'TAP', 'MDLZ', 'MNST', 'MCO', 'MS', 'MOS', 'MSI', 'MSCI', 'MYL', 'NDAQ', 'NOV', 'NKTR', 'NTAP', 'NFLX', 'NWL', 'NEM', 'NWSA', 'NWS', 'NEE', 'NLSN', 'NKE', 'NI', 'NBL', 'JWN', 'NSC', 'NTRS', 'NOC', 'NCLH', 'NRG', 'NUE', 'NVDA', 'ORLY', 'OXY', 'OMC', 'OKE', 'ORCL', 'PCAR', 'PKG', 'PH', 'PAYX', 'PYPL', 'PNR', 'PBCT', 'PEP', 'PKI', 'PRGO', 'PFE', 'PM', 'PSX', 'PNW', 'PXD', 'PNC', 'PPG', 'PPL', 'PFG', 'PG', 'PGR', 'PLD', 'PRU', 'PEG', 'PSA', 'PHM', 'PVH', 'QRVO', 'PWR', 'QCOM', 'DGX', 'RL', 'RJF', 'RTN', 'O', 'RHT', 'REG', 'REGN', 'RF', 'RSG', 'RMD', 'RHI', 'ROK', 'ROL', 'ROP', 'ROST', 'RCL', 'CRM', 'SBAC', 'SLB', 'STX', 'SEE', 'SRE', 'SHW', 'SPG', 'SWKS', 'SLG', 'SNA', 'SO', 'LUV', 'SPGI', 'SWK', 'SBUX', 'STT', 'SYK', 'STI', 'SIVB', 'SYMC', 'SYF', 'SNPS', 'SYY', 'TROW', 'TTWO', 'TPR', 'TGT', 'TEL', 'FTI', 'TFX', 'TXN', 'TXT', 'TMO', 'TIF', 'TWTR', 'TJX', 'TMK', 'TSS', 'TSCO', 'TDG', 'TRV', 'TRIP', 'TSN', 'UDR', 'ULTA', 'USB', 'UAA', 'UA', 'UNP', 'UAL', 'UNH', 'UPS', 'URI', 'UTX', 'UHS', 'UNM', 'VFC', 'VLO', 'VAR', 'VTR', 'VRSN', 'VRSK', 'VZ', 'VRTX', 'VIAB', 'V', 'VNO', 'VMC', 'WAB', 'WMT', 'WBA', 'DIS', 'WM', 'WAT', 'WEC', 'WCG', 'WFC', 'WELL', 'WDC', 'WU', 'WRK', 'WY', 'WHR', 'WMB', 'WLTW', 'WYNN', 'XEL', 'XRX', 'XLNX', 'XYL', 'YUM', 'ZBH', 'ZION', 'ZTS']

Upvotes: 2

jezrael
jezrael

Reputation: 862601

Problem is if use:

df.columns = df.iloc[0]

...then rewrite columns of DataFrame by first data row, so not exist original Symbol column and error raise:

url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
df = pd.read_html(url)[0]
print (df.head(3))
  Symbol             Security SEC filings  GICS Sector  \
0    MMM           3M Company     reports  Industrials   
1    ABT  Abbott Laboratories     reports  Health Care   
2   ABBV          AbbVie Inc.     reports  Health Care   

          GICS Sub Industry    Headquarters Location Date first added  \
0  Industrial Conglomerates      St. Paul, Minnesota              NaN   
1     Health Care Equipment  North Chicago, Illinois       1964-03-31   
2           Pharmaceuticals  North Chicago, Illinois       2012-12-31   

       CIK      Founded  
0    66740         1902  
1     1800         1888  
2  1551152  2013 (1888) 

print (df.columns)
Index(['Symbol', 'Security', 'SEC filings', 'GICS Sector', 'GICS Sub Industry',
       'Headquarters Location', 'Date first added', 'CIK', 'Founded'],
      dtype='object')

df.columns = df.iloc[0]
print (df.head(3))
0   MMM           3M Company  reports  Industrials  Industrial Conglomerates  \
0   MMM           3M Company  reports  Industrials  Industrial Conglomerates   
1   ABT  Abbott Laboratories  reports  Health Care     Health Care Equipment   
2  ABBV          AbbVie Inc.  reports  Health Care           Pharmaceuticals   

0      St. Paul, Minnesota         NaN    66740         1902  
0      St. Paul, Minnesota         NaN    66740         1902  
1  North Chicago, Illinois  1964-03-31     1800         1888  
2  North Chicago, Illinois  2012-12-31  1551152  2013 (1888) 

print (df.columns)
Index([                     'MMM',               '3M Company',
                        'reports',              'Industrials',
       'Industrial Conglomerates',      'St. Paul, Minnesota',
                              nan,                      66740,
                           '1902'],
      dtype='object', name=0)

Upvotes: 1

Related Questions