Reputation: 6020
I would like to read a table on a website and parse the values. For this I did the following:
url = 'http://www.astro.keele.ac.uk/jkt/debcat/'
df = pd.read_html(url, header=0)
Even with header=0, I am still left with a header with is df[0], so I do the following:
df = df[1]
df1.shape
(161, 11)
df1.columns
Index([u' System ', u' Period (days) ', u' V B-V ', u' Spectral type ', u' Mass (Msun )', u' Radius (Rsun) ', u' Surface gravity (cgs) ', u' log Teff (K) ', u' log (L/Lsun) ', u' [M/H] (dex) ', u' References and notes '], dtype='object')
However, I cannot get
df1.Period
'DataFrame' object has no attribute 'period'
Nor can I do,
df1.to_csv('junk.csv')
So, how do I access the columns and clean the table? Thanks!
Upvotes: 1
Views: 951
Reputation: 353229
ISTM like it's already in a fair enough format:
>>> url = 'http://www.astro.keele.ac.uk/jkt/debcat/'
>>> df = pd.read_html(url, header=0)
>>> df1 = df[1]
>>> df1.head()
System Period (days) V B-V Spectral type \
0 V3903 Sgr 1.744 NaT NaT
1 V467 Vel 2.753 NaT NaT
2 EM Car 3.414 NaT NaT
3 Y Cyg 2.996 NaT NaT
4 V478 Cyg 2.881 NaT NaT
Mass (Msun ) Radius (Rsun) \
0 27.27 ± 0.55 19.01 ± 0.44 8.088 ± 0.086 6.125 ± 0.060
1 25.3 ± 0.7 8.25 ± 0.17 9.99 ± 0.09 3.49 ± 0.03
2 22.89 ± 0.32 21.43 ± 0.33 9.35 ± 0.17 8.34 ± 0.14
3 17.57 ± 0.27 17.04 ± 0.26 5.93 ± 0.07 5.78 ± 0.07
4 16.67 ± 0.45 16.31 ± 0.35 7.423 ± 0.079 7.423 ± 0.079
Surface gravity (cgs) log Teff (K) \
0 4.058 ± 0.016 4.143 ± 0.013 4.580 ± 0.021 4.531 ± 0.021
1 3.842 ± 0.016 4.268 ± 0.017 4.559 ± 0.031 4.402 ± 0.046
2 3.856 ± 0.017 3.926 ± 0.016 4.531 ± 0.026 4.531 ± 0.026
3 4.16 ± 0.10 4.18 ± 0.10 4.545 ± 0.007 4.534 ± 0.007
4 3.919 ± 0.015 3.909 ± 0.013 4.484 ± 0.015 4.485 ± 0.015
log (L/Lsun) [M/H] (dex) \
0 5.087 ± 0.029 4.658 ± 0.032 NaN
1 5.187 ± 0.126 3.649 ± 0.110 NaN
2 5.02 ± 0.10 4.92 ± 0.10 NaN
3 NaN 0.00 ± 0.00
4 4.63 ± 0.06 4.63 ± 0.06 NaN
References and notes
0 Vaz et al. (1997A&A...327.1094V)
1 Michalska et al. (2013MNRAS.429.1354M)
2 Andersen & Clausen (1989A&A...213..183A)
3 Simon, Sturm & Fiedler (1994A&A...292..507S)
4 Popper & Hill (1991AJ....101..600P) Popper & E...
[5 rows x 11 columns]
Since you know how to look at the columns:
>>> df1.columns
Index([u' System ', u' Period (days) ', u' V B-V ', u' Spectral type ', u' Mass (Msun )', u' Radius (Rsun) ', u' Surface gravity (cgs) ', u' log Teff (K) ', u' log (L/Lsun) ', u' [M/H] (dex) ', u' References and notes '], dtype='object')
It shouldn't be surprising that df.Period
doesn't work-- none of the columns are called Period
, after all. Pandas isn't going to randomly guess which one looks closest. If you want to process the column names, you can do something like
>>> df1.columns = [x.strip() for x in df1.columns] # get rid of the leading/trailing spaces
>>> df1 = df1.rename(columns={"Period (days)": "Period"})
after which df1["Period"]
(preferred) and df1.Period
(shortcut) will work:
>>> df1["Period"].describe()
count 161.000000
mean 32.035019
std 98.392634
min 0.452000
25% 2.293000
50% 3.895000
75% 9.945000
max 771.781000
Name: Period, dtype: float64
"Nor can I do df1.to_csv('junk.csv')
" isn't an error report as you don't explain why you can't, or what happens when you do. I'm assuming that you're getting an encoding error:
>>> df1.to_csv("out.csv")
Traceback (most recent call last):
[...]
File "lib.pyx", line 845, in pandas.lib.write_csv_rows (pandas/lib.c:14261)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb1' in position 6: ordinal not in range(128)
which can be avoided if you specify an appropriate encoding:
>>> df1.to_csv("out.csv", encoding="utf8")
Upvotes: 1
Reputation: 77971
the columns name is parsed a u' Period (days) '
, so to access the column:
>>> df1[ u' Period (days) ' ]
that said, you need to use a html parsing libraries for this type of job; for example BeautifulSoup
can do it very neatly;
>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen
>>> url = 'http://www.astro.keele.ac.uk/jkt/debcat/'
>>> html = urlopen(url).read()
>>> soup = BeautifulSoup(html)
>>> # catch the target table by its attributes
>>> table = soup.find('table', attrs={'frame':'BOX', 'rules':'ALL'})
>>> # parse the table as a list of lists; each row as a single list
>>> tbl = [[td.getText() for td in tr.findAll(['td', 'th'])] for tr in table.findAll('tr')]
the tbl
at the end is the target table as a list of list; ie. each row is a list of the values of the cells in that row; for example tbl[0]
is simply the header:
>>> tbl[0]
[u' System ', u' Period (days) ', u' V B-V ', u' Spectral type ', u' Mass (Msun )', u' Radius (Rsun) ', u' Surface gravity (cgs) ', u' log Teff (K) ', u' log (L/Lsun) ', u' [M/H] (dex) ', u' References and notes ']
Upvotes: 0