GLaDER
GLaDER

Reputation: 345

Extract chunks of pandas dataframe when "value in column X" changes

I have a dataframe df with ~30 columns and 100 000 rows. I want to split the dataframe (and ultimately extract the data to np.arrays) depending on the value of a particular cell.

Example

Given the following dataframe

   C1 C2 C3 C4
 0  0  0  0  0
 1  0  0  1  0
 2  0  0  1  0
 3  0  0  2  0
 4  0  0  2  0
 5  0  0  2  0
 6  0  0  0  0
 7  0  0  1  0
 8  0  0  3  0
 9  0  0  5  0
10  0  0  2  0
11  0  0  2  0
12  0  0  2  0

I want to split it when column C3 changes from 2. I.e. after row 5, such that rows 0 - 5 becomes one dataframe (or np.array) and 6-12 becomes another one. The result should be:

   C1 C2 C3 C4
 0  0  0  0  0
 1  0  0  1  0
 2  0  0  1  0
 3  0  0  2  0
 4  0  0  2  0
 5  0  0  2  0

and

   C1 C2 C3 C4
 0  0  0  0  0
 1  0  0  1  0
 2  0  0  3  0
 3  0  0  5  0
 4  0  0  2  0
 5  0  0  2  0
 6  0  0  2  0

The columns in my actual dataframe have different values (numerical or strings).

My current approach is to iterate through the entire frame and create an "index array" where I figure out how to split the dataframe. Firstly, it does not do exactly what I want it to do and there must be a more efficient way to do this...

def filter_data(df):

    indices = []        # list to hold indices where we are to split the frame
    counter = -1
    waiting_for_new_index = True

    for idx, v in df["column_of_interest"].iteritems(): # row index and corresponding value

        if v != 2:
            waiting_for_new_index = True
            continue
        else:
            if waiting_for_new_index:
                waiting_for_new_index = False
                counter += 1

        # Add the counter to the index
        indices.append(counter)

    return indices

Upvotes: 3

Views: 1122

Answers (1)

jezrael
jezrael

Reputation: 862671

You can use:

a = (df.C3.eq(2).diff() & df.C3.shift().eq(2)).cumsum()
print (a)
0     0
1     0
2     0
3     0
4     0
5     0
6     1
7     1
8     1
9     1
10    1
11    1
12    1
Name: C3, dtype: int32

dfs = dict(tuple(df.groupby(a)))
print (dfs)
{0:    C1  C2  C3  C4
0   0   0   0   0
1   0   0   1   0
2   0   0   1   0
3   0   0   2   0
4   0   0   2   0
5   0   0   2   0, 1:     C1  C2  C3  C4
6    0   0   0   0
7    0   0   1   0
8    0   0   3   0
9    0   0   5   0
10   0   0   2   0
11   0   0   2   0
12   0   0   2   0}

Detail:

df['a'] =  df.C3.eq(2).diff()
df['b'] = df.C3.shift().eq(2)
df['c'] = df.a & df.b
df['d'] = df['c'].cumsum()
print (df)
    C1  C2  C3  C4      a      b      c  d
0    0   0   0   0    NaN  False  False  0
1    0   0   1   0  False  False  False  0
2    0   0   1   0  False  False  False  0
3    0   0   2   0   True  False  False  0
4    0   0   2   0  False   True  False  0
5    0   0   2   0  False   True  False  0
6    0   0   0   0   True   True   True  1
7    0   0   1   0  False  False  False  1
8    0   0   3   0  False  False  False  1
9    0   0   5   0  False  False  False  1
10   0   0   2   0   True  False  False  1
11   0   0   2   0  False   True  False  1
12   0   0   2   0  False   True  False  1

Upvotes: 2

Related Questions