Reputation: 4318
I have a huge set of catalogues which have different columns and the different header names for each column, where the description for each header name is given as comments at the beginning of my ascii files in a row. What is the best way to read them into a pandas.DataFrame
while it can set the name of the column as well without being needed to define it from the beginning. The following is an example of my catalogues:
# 1 MAG_AUTO Kron-like elliptical aperture magnitude [mag]
# 2 rh half light radius (analyse) [pixel]
# 3 MU_MAX Peak surface brightness above background [mag * arcsec**(-2)]
# 4 FWHM_IMAGE FWHM assuming a gaussian core [pixel]
# 5 CLASS_STAR S/G classifier output
18.7462 4.81509 20.1348 6.67273 0.0286538
18.2440 7.17988 20.6454 21.6235 0.0286293
18.3102 3.11273 19.0960 8.26081 0.0430532
21.1751 2.92533 21.9931 5.52080 0.0290418
19.3998 1.86182 19.3166 3.42346 0.986598
20.0801 3.52828 21.3484 6.76799 0.0303842
21.9427 2.08458 22.0577 5.59344 0.981466
20.7726 1.86017 20.8130 3.69570 0.996121
23.0836 2.23427 23.3689 4.49985 0.706207
23.2443 1.62021 23.1089 3.54191 0.973419
20.6343 3.99555 21.9426 6.94700 0.0286164
23.4012 2.00408 23.3412 4.35926 0.946349
23.8427 1.54819 23.8241 3.83407 0.897079
20.3344 2.69910 20.9401 4.38988 0.0355277
21.7506 2.43451 22.2115 4.62045 0.0786921
Upvotes: 4
Views: 10182
Reputation: 2542
This is a file in Sextractor format. The astropy.io.ascii
reader understands this format natively so this is a snap to read:
>>> from astropy.io import ascii
>>> dat = ascii.read('table.dat')
>>> dat
<Table masked=False length=3>
MAG_AUTO rh MU_MAX FWHM_IMAGE CLASS_STAR
mag mag / arcsec2 pix
float64 float64 float64 float64 float64
-------- ------- ------------- ---------- ----------
18.7462 4.81509 20.1348 6.67273 0.0286538
18.244 7.17988 20.6454 21.6235 0.0286293
18.3102 3.11273 19.096 8.26081 0.0430532
...
Note that using the astropy ASCII reader you get a table that also retains the unit meta data.
If you still want to convert this to a pandas dataframe that's easy as well with DataFrame(dat.as_array())
. Version 1.1 of astropy (and the current master) will have methods to_pandas
and from_pandas
that make this conversion more robust (see http://astropy.readthedocs.org/en/latest/table/pandas.html).
Upvotes: 10
Reputation: 68186
Ok, assuming all of your header info is encoded in the exact same way, here's how I would do this:
import re
import pandas
COMMENT_CHAR = '#'
columns = []
with open('test.dat', 'r') as td:
for line in td:
# find the commented lines
if line[0] == COMMENT_CHAR:
info = re.split(' +', line)
columns.append(info[2])
# when we seethe first line that doesn't start with
# COMMENT_CHAR, we pass the remaining lines of the
# file to pandas.read_table and break our loop
else:
_dfs = [
pandas.DataFrame([line.split(' ')], columns=columns, dtype=float),
pandas.read_table(td, sep='\s', header=None, names=columns)
]
df = pandas.concat(_dfs, ignore_index=True)
To break down the initial parsing a bit, re.split(' +', line)
turns this:
# 1 MAG_AUTO Kron-like elliptical aperture magnitude [mag]
into
['#', '1', 'MAG_AUTO', 'Kron-like', 'elliptical', 'aperture', 'magnitude', '[mag]']
So we take the column name as the 3 element (index = 2).
All this produces a dataframe that looks like this:
print(df.head())
MAG_AUTO rh MU_MAX FWHM_IMAGE CLASS_STAR
0 18.7462 4.81509 20.1348 6.67273 0.0286538
1 18.2440 7.17988 20.6454 21.62350 0.028629
2 18.3102 3.11273 19.0960 8.26081 0.043053
3 21.1751 2.92533 21.9931 5.52080 0.029042
4 19.3998 1.86182 19.3166 3.42346 0.986598
And df.info()
gives us:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns (total 5 columns):
MAG_AUTO 15 non-null float64
rh 15 non-null float64
MU_MAX 15 non-null float64
FWHM_IMAGE 15 non-null float64
CLASS_STAR 15 non-null float64
dtypes: float64(5)
memory usage: 720.0 bytes
Upvotes: 2