Optimization RunTime of DataFrame Split in Sub DataFrames in Python

Question

I do have a pandas DF (df_main) which I try to split into different subsets. The dataset look something like this:

a b c d e f

1 1 1 2 1 2   1.

2 3 2 1 2 1   2.

3 1 3 1 3 1   3.

3 2 1 3 4 1   4.

3 1 3 4 2 1   5.

2 1 2 3 4 2   6.

1 2 3 4 5 3   7.

I want to split the complete df based on the element of column a and it's following element into 3 subsets.

Subset 1: increasing values of col(a), so 1., 2., 3.

Subset 2: value of col(a) stays constant so 3., 4., 5.

Subset 3: decreasing value of col (a) so 5., 6., 7.

My code looks at the moment like this:

df1_new = pd.DataFrame(columns=['a', 'b', 'c', 'd', 'e', 'f'])
df2_new = pd.DataFrame(columns=['a', 'b', 'c', 'd', 'e', 'f'])
df3_new = pd.DataFrame(columns=['a', 'b', 'c', 'd', 'e', 'f'])

for j in range(len(df_main['a'])):
    if df_main['a'][j] == df_main['a'][j + 1]:
        df1_new = df1_new.append(df_main.iloc[j])
    if df_main['a'][j] > df_main['a'][j + 1]:
        df2_new = df2_new.append(df_main.iloc[j])
    if df_main['a'][j] < df_main['a'][j + 1]:
        df3_new = df3_new.append(df_main.iloc[j])

Due to the fact, that the df_main has a length of 1 353 419 rows, it needs (atm) around 15hours to complete a run.

Are there any options to optimise the time it needs to run through the df and splits its?

I have red a bit about numpy vectorization, but I am not sure, if this would be a proper workaround here.

The pattern, based on incremetenting, decremeting and constant values could be seen here

Shubham Sharma · Accepted Answer

Use Series.gt, Series.lt and Series.eq along with Series.shift to create boolean masks m1, m2 and m3, then use these masks to filter/split the dataframe in the corresponding categories increasing, decreasing and constant:

s1, s2 = df['a'].shift(), df['a'].shift(-1)

m1 = df['a'].gt(s1) | df['a'].lt(s2)
m2 = df['a'].lt(s1) | df['a'].gt(s2)
m3 = df['a'].eq(s1) | df['a'].eq(s2)

incr, decr, const = df[m1], df[m2], df[m3]

Result:

print(incr)
   a  b  c  d  e  f  g
0  1  1  1  2  1  2  1
1  2  3  2  1  2  1  2
2  3  1  3  1  3  1  2

print(decr)
   a  b  c  d  e  f  g
4  3  1  3  4  2  1  4
5  2  1  2  3  4  2  1
6  1  2  3  4  5  3  1

print(const)
   a  b  c  d  e  f  g
2  3  1  3  1  3  1  2
3  3  2  1  3  4  1  3
4  3  1  3  4  2  1  4

Optimization RunTime of DataFrame Split in Sub DataFrames in Python

Answers (1)

Related Questions