Reputation: 631
I have df with 300-ish rows that are not always equally distributed. They look like this:
Lags Rep 1 Rep 2 Rep 3
12.500000000E-9 7671.039418 6605.763724 10144.873125
25.000000000E-9 -1.000000 -0.479659 1.454251
37.500000000E-9 31.978402 23.456005 29.678136
50.000000000E-9 5.315013 4.723746 0.227125
62.500000000E-9 1.806673 2.642384 2.681376
75.000000000E-9 NaN NaN NaN
83.500000000E-9 NaN NaN NaN
Time PhtA count 1 PhtA count 2 PhtA count 3
0.000000000E+0 42.743683 10.890961 12.454987
2.428800000E-3 14.533997 8.125305 7.534027
4.857600000E-3 8.621216 7.686615 7.133484
7.286400000E-3 5.779266 10.147095 6.561279
9.715200000E-3 6.046295 8.201599 5.187988
12.144000000E-3 5.226135 7.343292 5.855560
Time PhtB count 1 PhtB count 2 PhtB count 3
0.860800000E-3 12.626648 13.580322 8.220673
1.289600000E-3 10.814667 21.381378 7.038116
2.718400000E-3 7.915497 17.261505 7.648468
3.147200000E-3 9.403229 21.266937 10.013580
When split, it would be nice to have 3 dfs like this:
First df:
Lags Rep 1 Rep 2 Rep 3
12.500000000E-9 7671.039418 6605.763724 10144.873125
25.000000000E-9 -1.000000 -0.479659 1.454251
37.500000000E-9 31.978402 23.456005 29.678136
50.000000000E-9 5.315013 4.723746 0.227125
62.500000000E-9 1.806673 2.642384 2.681376
Second df:
Time PhtA count 1 PhtA count 2 PhtA count 3
0.000000000E+0 42.743683 10.890961 12.454987
2.428800000E-3 14.533997 8.125305 7.534027
4.857600000E-3 8.621216 7.686615 7.133484
7.286400000E-3 5.779266 10.147095 6.561279
9.715200000E-3 6.046295 8.201599 5.187988
12.144000000E-3 5.226135 7.343292 5.855560
Third df
Time PhtB count 1 PhtB count 2 PhtB count 3
0.860800000E-3 12.626648 13.580322 8.220673
1.289600000E-3 10.814667 21.381378 7.038116
2.718400000E-3 7.915497 17.261505 7.648468
3.147200000E-3 9.403229 21.266937 10.013580
The length of the three chunks is not always the same, that's why I'm asking for help to solve this programatically. A couple of details about the first df I can say is that:
the first chunk always end up with a bunch of lines with values NaN (in the example are only two)
There are two more chunks that start with named column headers (Time, PhtA count 1, PhtA count 2, ...)
The last two chunks do not have any NaN values
All chunks have variable number of rows, although the headers are always the same
There is always an empty line separating the chunks
Any help will be appreciated.
Thanks in advance.
Upvotes: 0
Views: 744
Reputation: 30589
First read all data into a df preserving empty lines and then split it at these empty lines and convert to numeric:
df = pd.read_csv('data.csv', sep='\s{2,}', skip_blank_lines=False, engine='python')
x = df[df.Lags.isnull()==True].index.values
df1 = df[0:x[0]].dropna().apply(pd.to_numeric)
df2 = df[x[0]+2:x[1]].apply(pd.to_numeric)
df2.columns=df.iloc[x[0]+1].values
df3 = df[x[1]+2:].apply(pd.to_numeric)
df3.columns = df.iloc[x[1]+1].values
Output of print(df1);print(df2); print(df3)
:
Lags Rep 1 Rep 2 Rep 3
0 1.250000e-08 7671.039418 6605.763724 10144.873125
1 2.500000e-08 -1.000000 -0.479659 1.454251
2 3.750000e-08 31.978402 23.456005 29.678136
3 5.000000e-08 5.315013 4.723746 0.227125
4 6.250000e-08 1.806673 2.642384 2.681376
Time PhtA count 1 PhtA count 2 PhtA count 3
9 0.000000 42.743683 10.890961 12.454987
10 0.002429 14.533997 8.125305 7.534027
11 0.004858 8.621216 7.686615 7.133484
12 0.007286 5.779266 10.147095 6.561279
13 0.009715 6.046295 8.201599 5.187988
14 0.012144 5.226135 7.343292 5.855560
Time PhtB count 1 PhtB count 2 PhtB count 3
17 0.000861 12.626648 13.580322 8.220673
18 0.001290 10.814667 21.381378 7.038116
19 0.002718 7.915497 17.261505 7.648468
20 0.003147 9.403229 21.266937 10.013580
df = pd.read_csv('data.csv', sep='\s{2,}', skip_blank_lines=False, engine='python', header=None)
x = [-1] + list(df[df.iloc[:,0].isnull()==True].index.values) + [len(df)]
for i in range(1,len(x)):
globals()[f'df{i}'] = df[x[i-1]+2:x[i]].dropna().apply(pd.to_numeric)
globals()[f'df{i}'].columns = df.iloc[x[i-1]+1].values
Upvotes: 2