Steven C. Howell
Steven C. Howell

Reputation: 18614

Read commented column names into pandas

I want to load some data into python pandas for manipulation as simply as possible. The data is in the following format which I do not control:

# file generated on Sun Feb  8 23:42:56 2015
# structure, X2, Rg, filename
1       1.100341        83.720000       run2/crysol//run2_00001 0.840187
2       1.048154        83.730000       run2/crysol//run2_00002 0.788968
3       1.056927        83.730000       run2/crysol//run2_00003 0.793954
4       1.481612        84.040000       run2/crysol//run2_00004 0.786256

I found a way to read the data in using read_table but was only able to extract the column names by opening the text file and reading the line separately using the following:

In [1]: data = pd.read_table('data.txt', sep='\t', skiprows=2, header=None)
Out[1]:
     0          1      2                        3         4
0    1   1.100341  83.72  run2/crysol//run2_00001  0.840187
1    2   1.048154  83.73  run2/crysol//run2_00002  0.788968
2    3   1.056927  83.73  run2/crysol//run2_00003  0.793954
3    4   1.481612  84.04  run2/crysol//run2_00004  0.786256
In [2]: header = open('data.txt', 'r').readlines()[1].lstrip('#').strip().split(', ')
In [3]: header.append('unknown')
In [4]: data.columns = header
In [5]: print data
    structure         X2     Rg                 filename   unknown
0           1   1.100341  83.72  run2/crysol//run2_00001  0.840187
1           2   1.048154  83.73  run2/crysol//run2_00002  0.788968
2           3   1.056927  83.73  run2/crysol//run2_00003  0.793954
3           4   1.481612  84.04  run2/crysol//run2_00004  0.786256
4           5   1.634436  84.03  run2/crysol//run2_00005  0.935468

It realize it complicate things that the column names are comma separated while the data is tab separated. Further, I am not sure why one of the column labels is left out of the data header.

Is there a way to do this more simply using pandas?

Upvotes: 1

Views: 299

Answers (1)

Adam Hughes
Adam Hughes

Reputation: 16309

While it's probably possible to tell the parser how to deal with a comma-delimited header, here's a slight performance improvement to your solution. Read only the first two lines of the file:

data = pd.read_table('data.txt', sep='\t', skiprows=2, header=None)
#read first two lines
with open('data.txt', 'r') as f:
    _, line2 = f.readline(), f.readline()

Since readline() just moves a pointer line by line, this is more efficient than reading all of the lines.

data.columns = line2.lstrip('#').strip().split(', ')

Which column label is being left out of the data?

Upvotes: 1

Related Questions