thewyliestcoyote
thewyliestcoyote

Reputation: 13

Reading CSV's with Pandas and dealing with comments

This is a example of the data files I am trying to read in using Pandas. All the files have different number of comment lines but all start the data section with the BEGIN and end with END and maybe a newline after that.

!Example data file
!With commands delimited by exclamation points
!Not always the some number of comment lines
BEGIN
300,-1.0342501,-0.07359
5298,-0.9889674,0.06514
1029,-0.981307,0.130398
1529,-0.971765,0.1945281
END

This is my Pandas for reading in these files.

b = pd.read_csv(data_file,,names=['Frequency','Real','Imaginary'],comment='!') 

There are 2 problems I am having, first is it reads all the lines and just fills the comment lines will just be None or NaN and reads the BEGIN and END tags. This also results offset in the index of the cells, this is my second problem.

What would be the correct Pandas to read this into a data frame with removing the comment lines and the BEGIN and END tags? Is there a elegant one line of code that can solve both of my problems?

Upvotes: 1

Views: 1270

Answers (2)

BKay
BKay

Reputation: 1457

How about importing the whole file and dropping everything where the second field is blank?

import pandas as pd
import numpy as np    
b = pd.read_csv('sample2.csv',names=['Frequency','Real','Imaginary'],comment='!')
isnotnan = lambda x: not(np.isnan(x))
b2 = b[b['Real'].apply(isnotnan)]

Results b:

  Frequency      Real  Imaginary
0       NaN       NaN        NaN
1       NaN       NaN        NaN
2       NaN       NaN        NaN
3     BEGIN       NaN        NaN
4       300 -1.034250  -0.073590
5      5298 -0.988967   0.065140
6      1029 -0.981307   0.130398
7      1529 -0.971765   0.194528
8       END       NaN        NaN

Results b2:

  Frequency      Real  Imaginary
4       300 -1.034250  -0.073590
5      5298 -0.988967   0.065140
6      1029 -0.981307   0.130398
7      1529 -0.971765   0.194528

To reset the index:

b3 = b2.reset_index(drop = True)

Output of b3:

  Frequency      Real  Imaginary
0       300 -1.034250  -0.073590
1      5298 -0.988967   0.065140
2      1029 -0.981307   0.130398
3      1529 -0.971765   0.194528

Upvotes: 2

Warren Weckesser
Warren Weckesser

Reputation: 114841

Here's a variation of your code:

In [125]: df = pd.read_csv('data_file.csv', comment='!', header=0, names=['Frequency','Real','Imaginary'], na_values=['END'])

In [126]: df
Out[126]: 
   Frequency      Real  Imaginary
0        300 -1.034250  -0.073590
1       5298 -0.988967   0.065140
2       1029 -0.981307   0.130398
3       1529 -0.971765   0.194528
4        NaN       NaN        NaN

The 'END' in the last line is converted to NaN, so we'll remove the last row:

In [127]: df = df.iloc[:-1]    # or `df = df.dropna()`

In [128]: df
Out[128]: 
   Frequency      Real  Imaginary
0        300 -1.034250  -0.073590
1       5298 -0.988967   0.065140
2       1029 -0.981307   0.130398
3       1529 -0.971765   0.194528

Upvotes: 1

Related Questions