Reputation: 43
I do have a pandas DF (df_main) which I try to split into different subsets. The dataset look something like this:
a b c d e f
1 1 1 2 1 2 1.
2 3 2 1 2 1 2.
3 1 3 1 3 1 3.
3 2 1 3 4 1 4.
3 1 3 4 2 1 5.
2 1 2 3 4 2 6.
1 2 3 4 5 3 7.
I want to split the complete df
based on the element of column a and it's following element into 3 subsets.
Subset 1: increasing values of col(a)
, so 1., 2., 3.
Subset 2: value of col(a)
stays constant so 3., 4., 5.
Subset 3: decreasing value of col (a)
so 5., 6., 7.
My code looks at the moment like this:
df1_new = pd.DataFrame(columns=['a', 'b', 'c', 'd', 'e', 'f'])
df2_new = pd.DataFrame(columns=['a', 'b', 'c', 'd', 'e', 'f'])
df3_new = pd.DataFrame(columns=['a', 'b', 'c', 'd', 'e', 'f'])
for j in range(len(df_main['a'])):
if df_main['a'][j] == df_main['a'][j + 1]:
df1_new = df1_new.append(df_main.iloc[j])
if df_main['a'][j] > df_main['a'][j + 1]:
df2_new = df2_new.append(df_main.iloc[j])
if df_main['a'][j] < df_main['a'][j + 1]:
df3_new = df3_new.append(df_main.iloc[j])
Due to the fact, that the df_main has a length of 1 353 419 rows, it needs (atm) around 15hours to complete a run.
Are there any options to optimise the time it needs to run through the df and splits its?
I have red a bit about numpy vectorization, but I am not sure, if this would be a proper workaround here.
The pattern, based on incremetenting, decremeting and constant values could be seen here
Upvotes: 4
Views: 108
Reputation: 71689
Use Series.gt
, Series.lt
and Series.eq
along with Series.shift
to create boolean masks m1
, m2
and m3
, then use these masks to filter/split the dataframe in the corresponding categories increasing
, decreasing
and constant
:
s1, s2 = df['a'].shift(), df['a'].shift(-1)
m1 = df['a'].gt(s1) | df['a'].lt(s2)
m2 = df['a'].lt(s1) | df['a'].gt(s2)
m3 = df['a'].eq(s1) | df['a'].eq(s2)
incr, decr, const = df[m1], df[m2], df[m3]
Result:
print(incr)
a b c d e f g
0 1 1 1 2 1 2 1
1 2 3 2 1 2 1 2
2 3 1 3 1 3 1 2
print(decr)
a b c d e f g
4 3 1 3 4 2 1 4
5 2 1 2 3 4 2 1
6 1 2 3 4 5 3 1
print(const)
a b c d e f g
2 3 1 3 1 3 1 2
3 3 2 1 3 4 1 3
4 3 1 3 4 2 1 4
Upvotes: 2