Gonzalo Hernandez
Gonzalo Hernandez

Reputation: 737

Group latest values in pandas columns for a given id

I have a pandas dataframe containing some metrics for a given date and user.

>>> pd.DataFrame({"user": ['juan','juan','juan','gonzalo'], "date": [1, 2, 3, 1], "var1": [1, 2, None, 1], "var2": [None, 4, 5, 6]})
      user  date  var1  var2
0     juan     1   1.0   NaN
1     juan     2   2.0   4.0
2     juan     3   NaN   5.0
3  gonzalo     1   1.0   6.0

Now, for each user, I want to extract the 2 more recent values for each variable (var1, var2) ignoring NaN unless there aren't enough values to fill the data.

For reference, this should be the resulting dataframe for the data depicted above

user     var1_0  var1_1  var2_0  var2_1
juan       2.0     1.0     5.0    4.0
gonzalo    1.0      NaN    6.0    NaN

each "historical" value is added as a new column with a _0 or _1 suffix.

Upvotes: 0

Views: 71

Answers (2)

Murilo Cunha
Murilo Cunha

Reputation: 506

You could group by user and aggregate to get the 2 most recent values. That get's almost all the way there - but instead of columns you have a list of elements. If you want to have the actual 2 columns you have to split the newly created list into columns. Full code:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "user": ["juan", "juan", "juan", "gonzalo"],
        "date": [1, 2, 3, 1],
        "var1": [1, 2, None, 1],
        "var2": [None, 4, 5, 6],
    }
)

# This almost gets you there
df = (
    df.sort_values(by="date")
    .groupby("user")
    .agg({"var1": lambda x: x.dropna().head(2), "var2": lambda x: x.dropna().head(2)})
)

# Split the columns and get the correct column names
df[["var1_0", "var2_0"]] = df[["var1", "var2"]].apply(
    lambda row: pd.Series(el[0] if isinstance(el, np.ndarray) else el for el in row),
    axis=1,
)
df[["var1_1", "var2_1"]] = df[["var1", "var2"]].apply(
    lambda row: pd.Series(el[-1] if isinstance(el, np.ndarray) else None for el in row),
    axis=1,
)
print(df)

>>
               var1        var2  var1_0  var2_0  var1_1  var2_1
user                                                           
gonzalo         1.0         6.0     1.0     6.0     NaN     NaN
juan     [1.0, 2.0]  [4.0, 5.0]     1.0     4.0     2.0     5.0

Upvotes: 1

jezrael
jezrael

Reputation: 863166

First sorting if necessary by both columns in DataFrame.sort_values with reshape by DataFrame.sort_values and remove missing values, filter top2 rrows per groups by GroupBy.head, then create counter column by GroupBy.cumcount with pivoting in DataFrame.pivot with flatten MultiIndex:

df1 = (df.sort_values(['user','date'])
         .melt(id_vars='user', value_vars=['var1','var2'])
         .dropna(subset=['value'])
        )


df1 = df1.groupby(['user','variable']).head(2)
df1['g'] = df1.groupby(['user','variable']).cumcount(ascending=False)

df1 = df1.pivot(index='user', columns=['variable', 'g'], values='value')
#oldier pandas versions
#df1 = df1.set_index(['user','variable', 'g'])['value'].unstack([1,2])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.reset_index()
print (df1)
      user  var1_0  var1_1  var2_0  var2_1
0  gonzalo     1.0     NaN     6.0     NaN
1     juan     2.0     1.0     5.0     4.0

Upvotes: 2

Related Questions