Split dataframe dynamically

Question

I have df with 300-ish rows that are not always equally distributed. They look like this:

Lags     Rep 1      Rep 2     Rep 3 
12.500000000E-9     7671.039418     6605.763724     10144.873125
25.000000000E-9     -1.000000   -0.479659   1.454251
37.500000000E-9     31.978402   23.456005   29.678136
50.000000000E-9     5.315013    4.723746    0.227125
62.500000000E-9     1.806673    2.642384    2.681376
75.000000000E-9     NaN     NaN     NaN
83.500000000E-9     NaN     NaN     NaN

Time    PhtA count 1     PhtA count 2     PhtA count 3
0.000000000E+0  42.743683   10.890961   12.454987
2.428800000E-3  14.533997   8.125305    7.534027
4.857600000E-3  8.621216    7.686615    7.133484
7.286400000E-3  5.779266    10.147095   6.561279
9.715200000E-3  6.046295    8.201599    5.187988
12.144000000E-3     5.226135    7.343292    5.855560

Time    PhtB count 1     PhtB count 2     PhtB count 3
0.860800000E-3  12.626648   13.580322   8.220673
1.289600000E-3  10.814667   21.381378   7.038116
2.718400000E-3  7.915497    17.261505   7.648468
3.147200000E-3  9.403229    21.266937   10.013580

When split, it would be nice to have 3 dfs like this:

First df:

Lags     Rep 1      Rep 2     Rep 3 
12.500000000E-9     7671.039418     6605.763724     10144.873125
25.000000000E-9     -1.000000   -0.479659   1.454251
37.500000000E-9     31.978402   23.456005   29.678136
50.000000000E-9     5.315013    4.723746    0.227125
62.500000000E-9     1.806673    2.642384    2.681376

Second df:

Time    PhtA count 1     PhtA count 2     PhtA count 3
0.000000000E+0  42.743683   10.890961   12.454987
2.428800000E-3  14.533997   8.125305    7.534027
4.857600000E-3  8.621216    7.686615    7.133484
7.286400000E-3  5.779266    10.147095   6.561279
9.715200000E-3  6.046295    8.201599    5.187988
12.144000000E-3     5.226135    7.343292    5.855560

Third df

Time    PhtB count 1     PhtB count 2     PhtB count 3
0.860800000E-3  12.626648   13.580322   8.220673
1.289600000E-3  10.814667   21.381378   7.038116
2.718400000E-3  7.915497    17.261505   7.648468
3.147200000E-3  9.403229    21.266937   10.013580

The length of the three chunks is not always the same, that's why I'm asking for help to solve this programatically. A couple of details about the first df I can say is that:

the first chunk always end up with a bunch of lines with values NaN (in the example are only two)
There are two more chunks that start with named column headers (Time, PhtA count 1, PhtA count 2, ...)
The last two chunks do not have any NaN values
All chunks have variable number of rows, although the headers are always the same
There is always an empty line separating the chunks

Any help will be appreciated.

Thanks in advance.

Stef · Accepted Answer

First read all data into a df preserving empty lines and then split it at these empty lines and convert to numeric:

df = pd.read_csv('data.csv', sep='\s{2,}', skip_blank_lines=False, engine='python')
x = df[df.Lags.isnull()==True].index.values

df1 = df[0:x[0]].dropna().apply(pd.to_numeric)

df2 = df[x[0]+2:x[1]].apply(pd.to_numeric)
df2.columns=df.iloc[x[0]+1].values

df3 = df[x[1]+2:].apply(pd.to_numeric)
df3.columns = df.iloc[x[1]+1].values

Output of print(df1);print(df2); print(df3):

           Lags        Rep 1        Rep 2         Rep 3
0  1.250000e-08  7671.039418  6605.763724  10144.873125
1  2.500000e-08    -1.000000    -0.479659      1.454251
2  3.750000e-08    31.978402    23.456005     29.678136
3  5.000000e-08     5.315013     4.723746      0.227125
4  6.250000e-08     1.806673     2.642384      2.681376
        Time  PhtA count 1  PhtA count 2  PhtA count 3
9   0.000000     42.743683     10.890961     12.454987
10  0.002429     14.533997      8.125305      7.534027
11  0.004858      8.621216      7.686615      7.133484
12  0.007286      5.779266     10.147095      6.561279
13  0.009715      6.046295      8.201599      5.187988
14  0.012144      5.226135      7.343292      5.855560
        Time  PhtB count 1  PhtB count 2  PhtB count 3
17  0.000861     12.626648     13.580322      8.220673
18  0.001290     10.814667     21.381378      7.038116
19  0.002718      7.915497     17.261505      7.648468
20  0.003147      9.403229     21.266937     10.013580

Bonus: Generic solution for an arbitrary number of data blocks in the csv, devided by empty lines (their number need not to be known in advance):

df = pd.read_csv('data.csv', sep='\s{2,}', skip_blank_lines=False, engine='python', header=None)
x = [-1] + list(df[df.iloc[:,0].isnull()==True].index.values) + [len(df)]
for i in range(1,len(x)):
     globals()[f'df{i}'] = df[x[i-1]+2:x[i]].dropna().apply(pd.to_numeric)
     globals()[f'df{i}'].columns = df.iloc[x[i-1]+1].values

Split dataframe dynamically

Answers (1)

Related Questions