Reputation: 18614
I want to load some data into python pandas for manipulation as simply as possible. The data is in the following format which I do not control:
# file generated on Sun Feb 8 23:42:56 2015
# structure, X2, Rg, filename
1 1.100341 83.720000 run2/crysol//run2_00001 0.840187
2 1.048154 83.730000 run2/crysol//run2_00002 0.788968
3 1.056927 83.730000 run2/crysol//run2_00003 0.793954
4 1.481612 84.040000 run2/crysol//run2_00004 0.786256
I found a way to read the data in using read_table but was only able to extract the column names by opening the text file and reading the line separately using the following:
In [1]: data = pd.read_table('data.txt', sep='\t', skiprows=2, header=None)
Out[1]:
0 1 2 3 4
0 1 1.100341 83.72 run2/crysol//run2_00001 0.840187
1 2 1.048154 83.73 run2/crysol//run2_00002 0.788968
2 3 1.056927 83.73 run2/crysol//run2_00003 0.793954
3 4 1.481612 84.04 run2/crysol//run2_00004 0.786256
In [2]: header = open('data.txt', 'r').readlines()[1].lstrip('#').strip().split(', ')
In [3]: header.append('unknown')
In [4]: data.columns = header
In [5]: print data
structure X2 Rg filename unknown
0 1 1.100341 83.72 run2/crysol//run2_00001 0.840187
1 2 1.048154 83.73 run2/crysol//run2_00002 0.788968
2 3 1.056927 83.73 run2/crysol//run2_00003 0.793954
3 4 1.481612 84.04 run2/crysol//run2_00004 0.786256
4 5 1.634436 84.03 run2/crysol//run2_00005 0.935468
It realize it complicate things that the column names are comma separated while the data is tab separated. Further, I am not sure why one of the column labels is left out of the data header.
Is there a way to do this more simply using pandas?
Upvotes: 1
Views: 299
Reputation: 16309
While it's probably possible to tell the parser
how to deal with a comma-delimited header, here's a slight performance improvement to your solution. Read only the first two lines of the file:
data = pd.read_table('data.txt', sep='\t', skiprows=2, header=None)
#read first two lines
with open('data.txt', 'r') as f:
_, line2 = f.readline(), f.readline()
Since readline()
just moves a pointer line by line, this is more efficient than reading all of the lines.
data.columns = line2.lstrip('#').strip().split(', ')
Which column label is being left out of the data?
Upvotes: 1