Reputation: 11
I have the following table in a Pandas dataframe:
Seconds | Color | Break | NaN | End |
---|---|---|---|---|
0.639588 | 123 | 4 | NaN | - |
1.149597 | 123 | 1 | NaN | - |
1.671333 | 123 | 2 | NaN | - |
1.802052 | 123 | 2 | NaN | - |
1.900091 | 123 | 1 | NaN | - |
2.031240 | 123 | 4 | NaN | - |
2.221477 | 123 | 3 | NaN | - |
2.631840 | 123 | 2 | NaN | - |
2.822245 | 123 | 1 | NaN | - |
2.911147 | 123 | 4 | NaN | - |
3.133344 | 123 | 1 | NaN | - |
3.531246 | 123 | 1 | NaN | - |
3.822389 | 123 | 1 | NaN | - |
3.999389 | 123 | 2 | NaN | - |
4.327990 | 123 | 4 | NaN | - |
I'm trying to extract subgroups of the column labelled as 'Break' in such a way that the first and last item of each group is a '4'. So, the first group should be: [4,1,2,2,1,4]; the second group: [4,3,2,1,4]; the third group: [4,1,1,1,2,4]. The last '4' of each group is the first '4' of the following group.
I have the following code:
groups = []
def extract_phrases_between_four(data, new_group = []):
row_iterator = data.iterrows()
for i, row in row_iterator: #for index, row in row_iterator
if row['Break_Level_Annotation'] != '4':
new_group.append(row['Break_Level_Annotation'])
if row['Break_Level_Annotation'] == '4':
new_group = []
new_group.append(row['Break_Level_Annotation'])
groups.append(new_group)
return groups
but my output is:
[[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,3,2,1],[4,3,2,1],[4,3,2,1],[4,3,2,1],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2]].
It's returning the same new_group repeatedly as many times as there are items in each new_group, while at the same time not including the final '4' of each new_group.
I've tried to move around the code but I can't seem to understand what the problem is. How can I get each new_group to include its first and final '4' and for the new_group to be included only once in the array 'groups'?
Upvotes: 0
Views: 225
Reputation: 490
The problem is that in each step in the for
loop, you are adding new_group
to groups
, although you are still adding elements to new_group
. You need to execute groups.append(new_group)
inside the second if
statement.
Also, pay attention that you can iterate directly the "Break"
column values instead of the whole dataframe and accessing each time to get the value.
I rewrote the code a little bit, and it looks as follows:
groups = []
new_group = []
for i in data["Break"]:
new_group.append(i)
if i == 4:
if len(new_group) > 1:
groups.append(new_group)
new_group = [4]
print(groups)
And there is the result:
[[4, 1, 2, 2, 1, 4], [4, 3, 2, 1, 4], [4, 1, 1, 1, 2, 4]]
Upvotes: 0
Reputation: 22493
IIUC you can extract the index and use list comprehension:
s = df.loc[df["Break"].eq(4)].index
print ([df.loc[np.arange(x, y+1), "Break"].tolist() for x, y in zip(s, s[1:])])
[[4, 1, 2, 2, 1, 4], [4, 3, 2, 1, 4], [4, 1, 1, 1, 2, 4]]
Upvotes: 3