Cleaning HTML tables with Pandas

Question

I would like to read a table on a website and parse the values. For this I did the following:

url = 'http://www.astro.keele.ac.uk/jkt/debcat/'    
df = pd.read_html(url, header=0)

Even with header=0, I am still left with a header with is df[0], so I do the following:

df = df[1]
df1.shape
(161, 11)
df1.columns
Index([u' System ', u' Period (days) ', u' V  B-V ', u' Spectral type ', u' Mass (Msun )', u' Radius (Rsun) ', u' Surface gravity (cgs) ', u' log Teff (K) ', u' log (L/Lsun) ', u' [M/H]  (dex) ', u' References and notes '], dtype='object')

However, I cannot get

df1.Period

'DataFrame' object has no attribute 'period'

Nor can I do,

df1.to_csv('junk.csv')

So, how do I access the columns and clean the table? Thanks!

DSM · Accepted Answer

ISTM like it's already in a fair enough format:

>>> url = 'http://www.astro.keele.ac.uk/jkt/debcat/'
>>> df = pd.read_html(url, header=0)
>>> df1 = df[1]
>>> df1.head()
     System    Period (days)   V  B-V   Spectral type   \
0  V3903 Sgr            1.744      NaT             NaT   
1   V467 Vel            2.753      NaT             NaT   
2     EM Car            3.414      NaT             NaT   
3      Y Cyg            2.996      NaT             NaT   
4   V478 Cyg            2.881      NaT             NaT   

                Mass (Msun )               Radius (Rsun)   \
0  27.27 ± 0.55 19.01 ± 0.44  8.088 ± 0.086 6.125 ± 0.060   
1     25.3 ± 0.7 8.25 ± 0.17      9.99 ± 0.09 3.49 ± 0.03   
2  22.89 ± 0.32 21.43 ± 0.33      9.35 ± 0.17 8.34 ± 0.14   
3  17.57 ± 0.27 17.04 ± 0.26      5.93 ± 0.07 5.78 ± 0.07   
4  16.67 ± 0.45 16.31 ± 0.35  7.423 ± 0.079 7.423 ± 0.079   

        Surface gravity (cgs)                 log Teff (K)   \
0  4.058 ± 0.016 4.143 ± 0.013  4.580 ± 0.021 4.531 ± 0.021   
1  3.842 ± 0.016 4.268 ± 0.017  4.559 ± 0.031 4.402 ± 0.046   
2  3.856 ± 0.017 3.926 ± 0.016  4.531 ± 0.026 4.531 ± 0.026   
3      4.16 ± 0.10 4.18 ± 0.10  4.545 ± 0.007 4.534 ± 0.007   
4  3.919 ± 0.015 3.909 ± 0.013  4.484 ± 0.015 4.485 ± 0.015   

                 log (L/Lsun)   [M/H]  (dex)   \
0  5.087 ± 0.029 4.658 ± 0.032            NaN   
1  5.187 ± 0.126 3.649 ± 0.110            NaN   
2      5.02 ± 0.10 4.92 ± 0.10            NaN   
3                          NaN    0.00 ± 0.00   
4      4.63 ± 0.06 4.63 ± 0.06            NaN   

                               References and notes   
0                   Vaz et al. (1997A&A...327.1094V)  
1             Michalska et al. (2013MNRAS.429.1354M)  
2           Andersen & Clausen (1989A&A...213..183A)  
3       Simon, Sturm & Fiedler (1994A&A...292..507S)  
4  Popper & Hill (1991AJ....101..600P) Popper & E...  

[5 rows x 11 columns]

Since you know how to look at the columns:

>>> df1.columns
Index([u' System ', u' Period (days) ', u' V  B-V ', u' Spectral type ', u' Mass (Msun )', u' Radius (Rsun) ', u' Surface gravity (cgs) ', u' log Teff (K) ', u' log (L/Lsun) ', u' [M/H]  (dex) ', u' References and notes '], dtype='object')

It shouldn't be surprising that df.Period doesn't work-- none of the columns are called Period, after all. Pandas isn't going to randomly guess which one looks closest. If you want to process the column names, you can do something like

>>> df1.columns = [x.strip() for x in df1.columns] # get rid of the leading/trailing spaces
>>> df1 = df1.rename(columns={"Period (days)": "Period"})

after which df1["Period"] (preferred) and df1.Period (shortcut) will work:

>>> df1["Period"].describe()
count    161.000000
mean      32.035019
std       98.392634
min        0.452000
25%        2.293000
50%        3.895000
75%        9.945000
max      771.781000
Name: Period, dtype: float64

"Nor can I do df1.to_csv('junk.csv')" isn't an error report as you don't explain why you can't, or what happens when you do. I'm assuming that you're getting an encoding error:

>>> df1.to_csv("out.csv")
Traceback (most recent call last):
[...]
 File "lib.pyx", line 845, in pandas.lib.write_csv_rows (pandas/lib.c:14261)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb1' in position 6: ordinal not in range(128)

which can be avoided if you specify an appropriate encoding:

>>> df1.to_csv("out.csv", encoding="utf8")

Cleaning HTML tables with Pandas

Answers (2)

Related Questions