reality stilts
reality stilts

Reputation: 95

Create new column using for loop Pandas

I have the following data set:

d = {'person':[1,1,1,1,1,1],'id':['-8','-5','-4','-3','-3','-2'],'obs': 
['A','B','C','D','E','F']}
df_start = pd.DataFrame(data=d)

Need to create an output dataset like:

d_end = {'id':[-8,-8,-5,-8,-5,-4,-5,-4,-3,-3,-5,-4,-3,-3],
'obs':['A','A','B','A','B','C','B','C','D','E','B','C','D','E'],
'id_group':[-8,-5,-5,-4,-4,-4,-3,-3,-3,-3,-2,-2,-2,-2]}
df_end = pd.DataFrame(data=d_end)

I am trying to group the rows using a new column called id_group that is created by comparing id values across rows. A single id will belong to its own id group. An id will belong to another id_group if (id+4) is greater than or equal to an id on another row.

Have not been able to get very far using a for loop trying to do this, very open to suggestions

Upvotes: 1

Views: 65

Answers (1)

ALollz
ALollz

Reputation: 59519

To do it without looping, perform one large merge and then subset afterwards:

df_start['id'] = df_start['id'].astype(int)
df_end = df_start.assign(dummy=1).merge(df_start[['id']].assign(dummy=1), 
                                        on='dummy',
                                        suffixes=['','_gp']).drop(columns='dummy')

df_end = df_end[df_end.id.between(df_end.id_gp - 4, df_end.id_gp)]

Output:

    person  id obs  id_gp
0        1  -8   A     -8
1        1  -8   A     -5
2        1  -8   A     -4
7        1  -5   B     -5
8        1  -5   B     -4
9        1  -5   B     -3
10       1  -5   B     -3
11       1  -5   B     -2
14       1  -4   C     -4
15       1  -4   C     -3
16       1  -4   C     -3
17       1  -4   C     -2
21       1  -3   D     -3
22       1  -3   D     -3
23       1  -3   D     -2
27       1  -3   E     -3
28       1  -3   E     -3
29       1  -3   E     -2
35       1  -2   F     -2

With looping, create subsets for all values within 4 of the id. Concatenate them all together.

df_start['id'] = df_start['id'].astype(int)

l = []
for id_gp in df_start.id.unique():
    l.append(df_start[df_start.id.between(id_gp-4, id_gp)].assign(id_gp = id_gp))

df_end = pd.concat(l, ignore_index=True)

Upvotes: 1

Related Questions