Reputation: 13
This is a example of the data files I am trying to read in using Pandas. All the files have different number of comment lines but all start the data section with the BEGIN
and end with END
and maybe a newline after that.
!Example data file
!With commands delimited by exclamation points
!Not always the some number of comment lines
BEGIN
300,-1.0342501,-0.07359
5298,-0.9889674,0.06514
1029,-0.981307,0.130398
1529,-0.971765,0.1945281
END
This is my Pandas for reading in these files.
b = pd.read_csv(data_file,,names=['Frequency','Real','Imaginary'],comment='!')
There are 2 problems I am having, first is it reads all the lines and just fills the comment lines will just be None or NaN and reads the BEGIN
and END
tags. This also results offset in the index of the cells, this is my second problem.
What would be the correct Pandas to read this into a data frame with removing the comment lines and the BEGIN
and END
tags? Is there a elegant one line of code that can solve both of my problems?
Upvotes: 1
Views: 1270
Reputation: 1457
How about importing the whole file and dropping everything where the second field is blank?
import pandas as pd
import numpy as np
b = pd.read_csv('sample2.csv',names=['Frequency','Real','Imaginary'],comment='!')
isnotnan = lambda x: not(np.isnan(x))
b2 = b[b['Real'].apply(isnotnan)]
Results b:
Frequency Real Imaginary
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 BEGIN NaN NaN
4 300 -1.034250 -0.073590
5 5298 -0.988967 0.065140
6 1029 -0.981307 0.130398
7 1529 -0.971765 0.194528
8 END NaN NaN
Results b2:
Frequency Real Imaginary
4 300 -1.034250 -0.073590
5 5298 -0.988967 0.065140
6 1029 -0.981307 0.130398
7 1529 -0.971765 0.194528
To reset the index:
b3 = b2.reset_index(drop = True)
Output of b3:
Frequency Real Imaginary
0 300 -1.034250 -0.073590
1 5298 -0.988967 0.065140
2 1029 -0.981307 0.130398
3 1529 -0.971765 0.194528
Upvotes: 2
Reputation: 114841
Here's a variation of your code:
In [125]: df = pd.read_csv('data_file.csv', comment='!', header=0, names=['Frequency','Real','Imaginary'], na_values=['END'])
In [126]: df
Out[126]:
Frequency Real Imaginary
0 300 -1.034250 -0.073590
1 5298 -0.988967 0.065140
2 1029 -0.981307 0.130398
3 1529 -0.971765 0.194528
4 NaN NaN NaN
The 'END' in the last line is converted to NaN, so we'll remove the last row:
In [127]: df = df.iloc[:-1] # or `df = df.dropna()`
In [128]: df
Out[128]:
Frequency Real Imaginary
0 300 -1.034250 -0.073590
1 5298 -0.988967 0.065140
2 1029 -0.981307 0.130398
3 1529 -0.971765 0.194528
Upvotes: 1