Reputation: 95
I have the following data set:
d = {'person':[1,1,1,1,1,1],'id':['-8','-5','-4','-3','-3','-2'],'obs':
['A','B','C','D','E','F']}
df_start = pd.DataFrame(data=d)
Need to create an output dataset like:
d_end = {'id':[-8,-8,-5,-8,-5,-4,-5,-4,-3,-3,-5,-4,-3,-3],
'obs':['A','A','B','A','B','C','B','C','D','E','B','C','D','E'],
'id_group':[-8,-5,-5,-4,-4,-4,-3,-3,-3,-3,-2,-2,-2,-2]}
df_end = pd.DataFrame(data=d_end)
I am trying to group the rows using a new column called id_group that is created by comparing id values across rows. A single id will belong to its own id group. An id will belong to another id_group if (id+4) is greater than or equal to an id on another row.
Have not been able to get very far using a for loop trying to do this, very open to suggestions
Upvotes: 1
Views: 65
Reputation: 59519
To do it without looping, perform one large merge and then subset afterwards:
df_start['id'] = df_start['id'].astype(int)
df_end = df_start.assign(dummy=1).merge(df_start[['id']].assign(dummy=1),
on='dummy',
suffixes=['','_gp']).drop(columns='dummy')
df_end = df_end[df_end.id.between(df_end.id_gp - 4, df_end.id_gp)]
person id obs id_gp
0 1 -8 A -8
1 1 -8 A -5
2 1 -8 A -4
7 1 -5 B -5
8 1 -5 B -4
9 1 -5 B -3
10 1 -5 B -3
11 1 -5 B -2
14 1 -4 C -4
15 1 -4 C -3
16 1 -4 C -3
17 1 -4 C -2
21 1 -3 D -3
22 1 -3 D -3
23 1 -3 D -2
27 1 -3 E -3
28 1 -3 E -3
29 1 -3 E -2
35 1 -2 F -2
With looping, create subsets for all values within 4 of the id
. Concatenate them all together.
df_start['id'] = df_start['id'].astype(int)
l = []
for id_gp in df_start.id.unique():
l.append(df_start[df_start.id.between(id_gp-4, id_gp)].assign(id_gp = id_gp))
df_end = pd.concat(l, ignore_index=True)
Upvotes: 1