Reputation: 345
I have a dataframe df
with ~30 columns and 100 000 rows. I want to split the dataframe (and ultimately extract the data to np.arrays) depending on the value of a particular cell.
Example
Given the following dataframe
C1 C2 C3 C4
0 0 0 0 0
1 0 0 1 0
2 0 0 1 0
3 0 0 2 0
4 0 0 2 0
5 0 0 2 0
6 0 0 0 0
7 0 0 1 0
8 0 0 3 0
9 0 0 5 0
10 0 0 2 0
11 0 0 2 0
12 0 0 2 0
I want to split it when column C3 changes from 2. I.e. after row 5, such that rows 0 - 5 becomes one dataframe (or np.array) and 6-12 becomes another one. The result should be:
C1 C2 C3 C4
0 0 0 0 0
1 0 0 1 0
2 0 0 1 0
3 0 0 2 0
4 0 0 2 0
5 0 0 2 0
and
C1 C2 C3 C4
0 0 0 0 0
1 0 0 1 0
2 0 0 3 0
3 0 0 5 0
4 0 0 2 0
5 0 0 2 0
6 0 0 2 0
The columns in my actual dataframe have different values (numerical or strings).
My current approach is to iterate through the entire frame and create an "index array" where I figure out how to split the dataframe. Firstly, it does not do exactly what I want it to do and there must be a more efficient way to do this...
def filter_data(df):
indices = [] # list to hold indices where we are to split the frame
counter = -1
waiting_for_new_index = True
for idx, v in df["column_of_interest"].iteritems(): # row index and corresponding value
if v != 2:
waiting_for_new_index = True
continue
else:
if waiting_for_new_index:
waiting_for_new_index = False
counter += 1
# Add the counter to the index
indices.append(counter)
return indices
Upvotes: 3
Views: 1122
Reputation: 862671
You can use:
a = (df.C3.eq(2).diff() & df.C3.shift().eq(2)).cumsum()
print (a)
0 0
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
11 1
12 1
Name: C3, dtype: int32
dfs = dict(tuple(df.groupby(a)))
print (dfs)
{0: C1 C2 C3 C4
0 0 0 0 0
1 0 0 1 0
2 0 0 1 0
3 0 0 2 0
4 0 0 2 0
5 0 0 2 0, 1: C1 C2 C3 C4
6 0 0 0 0
7 0 0 1 0
8 0 0 3 0
9 0 0 5 0
10 0 0 2 0
11 0 0 2 0
12 0 0 2 0}
Detail:
df['a'] = df.C3.eq(2).diff()
df['b'] = df.C3.shift().eq(2)
df['c'] = df.a & df.b
df['d'] = df['c'].cumsum()
print (df)
C1 C2 C3 C4 a b c d
0 0 0 0 0 NaN False False 0
1 0 0 1 0 False False False 0
2 0 0 1 0 False False False 0
3 0 0 2 0 True False False 0
4 0 0 2 0 False True False 0
5 0 0 2 0 False True False 0
6 0 0 0 0 True True True 1
7 0 0 1 0 False False False 1
8 0 0 3 0 False False False 1
9 0 0 5 0 False False False 1
10 0 0 2 0 True False False 1
11 0 0 2 0 False True False 1
12 0 0 2 0 False True False 1
Upvotes: 2