Jesus Rincon
Jesus Rincon

Reputation: 136

Iterative comparison with pandas

I don't know to approach this issue. I have a data frame that looks like this

cuenta_bancaria nombre_empresa  perfil_cobranza  usuario_id  usuario_web 
5545              a              123              500199         5012
5551              a              123              500199         3321
5551              a               55              500199         5541
5551              b               55              500199         5246

What I need to do is to iterate between each row per usuario_id and check if there's a difference between each row, and create a new data set with the row changed and the usuario_web in charge of this change, to generate a data frame that looks like this:

usuario_id     cambio           usuario_web
 500199       cuenta_bancaria    3321
 500199       perfil_cobranza    5541
 500199       nombre_empresa     5246

Is there any way to do this? I'm working with pandas on python and this dataset could be a little big, let's say around 10000 rows, sorted by usuario_id.

Thanks for any advice.

Upvotes: 0

Views: 163

Answers (2)

Jacob H
Jacob H

Reputation: 607

There are a couple ways to iterate over a dataframe:

for index, row in df.iterrows():
    #blah blah blah

but since you're wanting to reference the prior row, I think the easiest will be to iterate by position:

df2 = pd.DataFrame()
for i in range(1, np.shape(df)[0]):
    current = df.iloc[i]
    last = df.iloc[i-1]
    newrow = {'usario_id' = current['usario_id'], 'usario_web'= current['usario_web']}
    if current['cuenta_bancaria'] != last['cuenta_bancaria']:
        newrow['cambio'] = 'cuenta_bancaria'
        df2 = df2.append(newrow, ignore_index = False)
    elif current['nombre_empresa'] != last['nombre_empresa']:
        newrow['cambio'] = 'nombre_empresa'
        df2 = df2.append(newrow, ignore_index = False)
    elif current['perfil_cobranza'] != last['perfil_cobranza']:
        newrow['cambio'] = 'perfil_cobranza'
        df2 = df2.append(newrow, ignore_index = False)

Upvotes: 1

cs95
cs95

Reputation: 402363

Compare adjacent rows with ne + shift, obtain a mask, and use this to

  • index into df to get the required rows
  • index into df.columns to get the required columns which change
c = df.columns.intersection(
        ['nombre_empresa', 'perfil_cobranza', 'cuenta_bancaria']
)

i = df[c].ne(df[c].shift())
j = i.sum(1).eq(1)
df = df.loc[j, ['usuario_id', 'usuario_web']]
df.insert(1, 'cambio', c[i[j].values.argmax(1)])

df

   usuario_id           cambio  usuario_web
1      500199  cuenta_bancaria         3321
2      500199  perfil_cobranza         5541
3      500199   nombre_empresa         5246

Upvotes: 1

Related Questions