Reputation: 1
I'm trying to get all of the data from a table of basketball-reference (http://www.basketball-reference.com/leagues/NBA_2015_per_poss.html). When I use XPath to get the data, it comes in as one long list. I have a "chunks" method that would divide the list into multiple lists, but, as there are empty cells within the table, the method gets off and divides the list incorrectly. Is there any way to deal with this?
Upvotes: 0
Views: 39
Reputation: 81684
My suggestion: use pandas.DataFrame
. It can load data from many sources, including HTML.
You can easily handle empty cells with the fillna
method.
Consider this example:
import pandas as pd
# read_excel returns list of dataframes.
# In this case we know there is only one in the page
df = pd.read_html('http://www.basketball-reference.com/leagues/NBA_2015_per_poss.html',
attrs={'id': 'per_poss'})[0]
# the headers repeat every 20 lines, filtering them out
df = df[df['Rk'] != 'Rk']
# inserting 0 to empty cells
# could also use inplace=True kwarg instead of reassigning, or pass a
# dictionary to use different value for each column
df = df.fillna(0)
Upvotes: 1