Kallol
Kallol

Reputation: 2189

Split the data frame based on consecutive row values differences

I have a data frame like this,

df
col1    col2    col3
 1        2      3
 2        5      6
 7        8      9
10       11     12
11       12     13
13       14     15
14       15     16

Now I want to create multiple data frames from above when the col1 difference of two consecutive rows are more than 1. So the result data frames will look like,

df1
col1    col2    col3
 1        2      3
 2        5      6
df2
col1    col2    col3
 7        8      9
df3
col1    col2    col3
10       11     12
11       12     13
df4
col1    col2    col3
13       14     15
14       15     16

I can do this using for loop and storing the indices but this will increase execution time, looking for some pandas shortcuts or pythonic way to do this most efficiently.

Upvotes: 2

Views: 1177

Answers (2)

Chris M.
Chris M.

Reputation: 26

You can also peel off the target column and work with it as a series, rather than the above answer. That keeps everything smaller. It runs faster on the example, but I don't know how they'll scale up, depending how many times you're splitting.

row_bool = df['col1'].diff()>1
split_inds, = np.where(row_bool)
split_inds = np.insert(arr=split_inds, obj=[0,len(split_inds)], values=[0,len(df)])

df_tup = ()
for n in range(0,len(split_inds)-1):
    tempdf = df.iloc[split_inds[n]:split_inds[n+1],:]
    df_tup.append(tempdf)

(Just throwing it in a tuple of dataframes afterward, but the dictionary approach might be better?)

Upvotes: 0

yatu
yatu

Reputation: 88236

You could define a custom grouper by taking the diff, checking when it is greater than 1, and take the cumsum of the boolean series. Then group by the result and build a dictionary from the groupby object:

d = dict(tuple(df.groupby(df.col1.diff().gt(1).cumsum())))

print(d[0])
   col1  col2  col3
0     1     2     3
1     2     5     6

print(d[1])
   col1  col2  col3
2     7     8     9

A more detailed break-down:

df.assign(difference=(diff:=df.col1.diff()), 
          condition=(gt1:=diff.gt(1)), 
          grouper=gt1.cumsum())

   col1  col2  col3  difference  condition  grouper
0     1     2     3         NaN      False        0
1     2     5     6         1.0      False        0
2     7     8     9         5.0       True        1
3    10    11    12         3.0       True        2
4    11    12    13         1.0      False        2
5    13    14    15         2.0       True        3
6    14    15    16         1.0      False        3

Upvotes: 7

Related Questions