Rohit
Rohit

Reputation: 6020

Cleaning HTML tables with Pandas

I would like to read a table on a website and parse the values. For this I did the following:

url = 'http://www.astro.keele.ac.uk/jkt/debcat/'    
df = pd.read_html(url, header=0)

Even with header=0, I am still left with a header with is df[0], so I do the following:

df = df[1]
df1.shape
(161, 11)
df1.columns
Index([u' System ', u' Period (days) ', u' V  B-V ', u' Spectral type ', u' Mass (Msun )', u' Radius (Rsun) ', u' Surface gravity (cgs) ', u' log Teff (K) ', u' log (L/Lsun) ', u' [M/H]  (dex) ', u' References and notes '], dtype='object')

However, I cannot get

df1.Period

'DataFrame' object has no attribute 'period'

Nor can I do,

df1.to_csv('junk.csv')

So, how do I access the columns and clean the table? Thanks!

Upvotes: 1

Views: 951

Answers (2)

DSM
DSM

Reputation: 353229

ISTM like it's already in a fair enough format:

>>> url = 'http://www.astro.keele.ac.uk/jkt/debcat/'
>>> df = pd.read_html(url, header=0)
>>> df1 = df[1]
>>> df1.head()
     System    Period (days)   V  B-V   Spectral type   \
0  V3903 Sgr            1.744      NaT             NaT   
1   V467 Vel            2.753      NaT             NaT   
2     EM Car            3.414      NaT             NaT   
3      Y Cyg            2.996      NaT             NaT   
4   V478 Cyg            2.881      NaT             NaT   

                Mass (Msun )               Radius (Rsun)   \
0  27.27 ± 0.55 19.01 ± 0.44  8.088 ± 0.086 6.125 ± 0.060   
1     25.3 ± 0.7 8.25 ± 0.17      9.99 ± 0.09 3.49 ± 0.03   
2  22.89 ± 0.32 21.43 ± 0.33      9.35 ± 0.17 8.34 ± 0.14   
3  17.57 ± 0.27 17.04 ± 0.26      5.93 ± 0.07 5.78 ± 0.07   
4  16.67 ± 0.45 16.31 ± 0.35  7.423 ± 0.079 7.423 ± 0.079   

        Surface gravity (cgs)                 log Teff (K)   \
0  4.058 ± 0.016 4.143 ± 0.013  4.580 ± 0.021 4.531 ± 0.021   
1  3.842 ± 0.016 4.268 ± 0.017  4.559 ± 0.031 4.402 ± 0.046   
2  3.856 ± 0.017 3.926 ± 0.016  4.531 ± 0.026 4.531 ± 0.026   
3      4.16 ± 0.10 4.18 ± 0.10  4.545 ± 0.007 4.534 ± 0.007   
4  3.919 ± 0.015 3.909 ± 0.013  4.484 ± 0.015 4.485 ± 0.015   

                 log (L/Lsun)   [M/H]  (dex)   \
0  5.087 ± 0.029 4.658 ± 0.032            NaN   
1  5.187 ± 0.126 3.649 ± 0.110            NaN   
2      5.02 ± 0.10 4.92 ± 0.10            NaN   
3                          NaN    0.00 ± 0.00   
4      4.63 ± 0.06 4.63 ± 0.06            NaN   

                               References and notes   
0                   Vaz et al. (1997A&A...327.1094V)  
1             Michalska et al. (2013MNRAS.429.1354M)  
2           Andersen & Clausen (1989A&A...213..183A)  
3       Simon, Sturm & Fiedler (1994A&A...292..507S)  
4  Popper & Hill (1991AJ....101..600P) Popper & E...  

[5 rows x 11 columns]

Since you know how to look at the columns:

>>> df1.columns
Index([u' System ', u' Period (days) ', u' V  B-V ', u' Spectral type ', u' Mass (Msun )', u' Radius (Rsun) ', u' Surface gravity (cgs) ', u' log Teff (K) ', u' log (L/Lsun) ', u' [M/H]  (dex) ', u' References and notes '], dtype='object')

It shouldn't be surprising that df.Period doesn't work-- none of the columns are called Period, after all. Pandas isn't going to randomly guess which one looks closest. If you want to process the column names, you can do something like

>>> df1.columns = [x.strip() for x in df1.columns] # get rid of the leading/trailing spaces
>>> df1 = df1.rename(columns={"Period (days)": "Period"})

after which df1["Period"] (preferred) and df1.Period (shortcut) will work:

>>> df1["Period"].describe()
count    161.000000
mean      32.035019
std       98.392634
min        0.452000
25%        2.293000
50%        3.895000
75%        9.945000
max      771.781000
Name: Period, dtype: float64

"Nor can I do df1.to_csv('junk.csv')" isn't an error report as you don't explain why you can't, or what happens when you do. I'm assuming that you're getting an encoding error:

>>> df1.to_csv("out.csv")
Traceback (most recent call last):
[...]
 File "lib.pyx", line 845, in pandas.lib.write_csv_rows (pandas/lib.c:14261)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb1' in position 6: ordinal not in range(128)

which can be avoided if you specify an appropriate encoding:

>>> df1.to_csv("out.csv", encoding="utf8")

Upvotes: 1

behzad.nouri
behzad.nouri

Reputation: 77971

the columns name is parsed a u' Period (days) ', so to access the column:

>>> df1[ u' Period (days) ' ]

that said, you need to use a html parsing libraries for this type of job; for example BeautifulSoup can do it very neatly;

>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen

>>> url = 'http://www.astro.keele.ac.uk/jkt/debcat/'
>>> html = urlopen(url).read()
>>> soup = BeautifulSoup(html)

>>> # catch the target table by its attributes
>>> table = soup.find('table', attrs={'frame':'BOX', 'rules':'ALL'})

>>> # parse the table as a list of lists; each row as a single list
>>> tbl = [[td.getText() for td in tr.findAll(['td', 'th'])] for tr in table.findAll('tr')]

the tbl at the end is the target table as a list of list; ie. each row is a list of the values of the cells in that row; for example tbl[0] is simply the header:

>>> tbl[0]
[u' System ', u' Period (days) ', u' V  B-V ', u' Spectral type ', u' Mass (Msun )', u' Radius (Rsun) ', u' Surface gravity (cgs) ', u' log Teff (K) ', u' log (L/Lsun) ', u' [M/H]  (dex) ', u' References and notes ']

Upvotes: 0

Related Questions