ice queen
ice queen

Reputation: 11

Extract subgroups from pandas dataframe

I have the following table in a Pandas dataframe:

Seconds Color Break NaN End
0.639588 123 4 NaN -
1.149597 123 1 NaN -
1.671333 123 2 NaN -
1.802052 123 2 NaN -
1.900091 123 1 NaN -
2.031240 123 4 NaN -
2.221477 123 3 NaN -
2.631840 123 2 NaN -
2.822245 123 1 NaN -
2.911147 123 4 NaN -
3.133344 123 1 NaN -
3.531246 123 1 NaN -
3.822389 123 1 NaN -
3.999389 123 2 NaN -
4.327990 123 4 NaN -

I'm trying to extract subgroups of the column labelled as 'Break' in such a way that the first and last item of each group is a '4'. So, the first group should be: [4,1,2,2,1,4]; the second group: [4,3,2,1,4]; the third group: [4,1,1,1,2,4]. The last '4' of each group is the first '4' of the following group.

I have the following code:


groups = []

def extract_phrases_between_four(data, new_group = []):
   
   row_iterator = data.iterrows()
   for i, row in row_iterator: #for index, row in row_iterator
       
       if row['Break_Level_Annotation'] != '4': 
           new_group.append(row['Break_Level_Annotation']) 
       
           
       if row['Break_Level_Annotation'] == '4': 
           new_group = []
           new_group.append(row['Break_Level_Annotation'])
           
       groups.append(new_group)
   return groups

but my output is:

[[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,3,2,1],[4,3,2,1],[4,3,2,1],[4,3,2,1],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2]].

It's returning the same new_group repeatedly as many times as there are items in each new_group, while at the same time not including the final '4' of each new_group.

I've tried to move around the code but I can't seem to understand what the problem is. How can I get each new_group to include its first and final '4' and for the new_group to be included only once in the array 'groups'?

Upvotes: 0

Views: 225

Answers (2)

ronpi
ronpi

Reputation: 490

The problem is that in each step in the for loop, you are adding new_group to groups, although you are still adding elements to new_group. You need to execute groups.append(new_group) inside the second if statement.

Also, pay attention that you can iterate directly the "Break" column values instead of the whole dataframe and accessing each time to get the value.

I rewrote the code a little bit, and it looks as follows:

groups = []
new_group = []
for i in data["Break"]:
    new_group.append(i)
    if i == 4:
        if len(new_group) > 1:
            groups.append(new_group)
            new_group = [4]

print(groups)

And there is the result:

[[4, 1, 2, 2, 1, 4], [4, 3, 2, 1, 4], [4, 1, 1, 1, 2, 4]]

Upvotes: 0

Henry Yik
Henry Yik

Reputation: 22493

IIUC you can extract the index and use list comprehension:

s = df.loc[df["Break"].eq(4)].index

print ([df.loc[np.arange(x, y+1), "Break"].tolist() for x, y in zip(s, s[1:])])

[[4, 1, 2, 2, 1, 4], [4, 3, 2, 1, 4], [4, 1, 1, 1, 2, 4]]

Upvotes: 3

Related Questions