Mark
Mark

Reputation: 15

How to lookup the previous occurrence of a row's ID in another column, and get that previous occurrence row's value?

I have a large pandas dataframe of ~2million rows. Columns A and B are both ID columns, where ID==1 in Column A is the same ID entity as ID==1 in column B. Column C is the target value column.

Is there an efficient way, using pandas' native functions such as groupby etc or vectorisation methods, and avoiding looping, to:

For each row, lookup the previous occurrence in column A of row B's ID value, and get the previous occurrence row's column C value?

I've tried various groupby/rolling methods to no avail. Many thanks in advance.

Upvotes: 0

Views: 54

Answers (1)

Rawson
Rawson

Reputation: 2787

You can use a combination of pd.get_dummies and pd.mul.

A simple example:

    A   B   C
0   2   4   48
1   1   3   14
2   4   4   19
3   5   5   20
4   4   3   20
5   1   3   41
6   3   2   45
7   2   2   34
8   1   2   44
9   1   4   43

Here, indices 6-9 contain IDs in column B that occur in previous rows of column A. 6 and 7 match with index 0, 8 matches with index 7 and 9 with index 4.

# read data above
df = pd.read_clipboard()

# append "Out" column to df with the desired output
df["Out"] = pd.get_dummies(df.A) \
    .mul(df.C, axis=0).replace(0, method="ffill").shift() \
        .mul(pd.get_dummies(df.B)).max(axis=1).fillna(0)

df
#Out
#    A  B   C   Out
# 0  2  4  48   0.0
# 1  1  3  14   0.0
# 2  4  4  19   0.0
# 3  5  5  20   0.0
# 4  4  3  20   0.0
# 5  1  3  41   0.0
# 6  3  2  45  48.0
# 7  2  2  34  48.0
# 8  1  2  44  34.0
# 9  1  4  43  20.0

Breaking this down, the first step creates a pd.DataFrame of 1s and 0s for the IDs of column A. Then this is multiplied by the values in column C:

pd.get_dummies(df.A).mul(df.C, axis=0)
#     1   2   3   4   5
# 0   0  48   0   0   0
# 1  14   0   0   0   0
# 2   0   0   0  19   0
# 3   0   0   0   0  20
# 4   0   0   0  20   0
# 5  41   0   0   0   0
# 6   0   0  45   0   0
# 7   0  34   0   0   0
# 8  44   0   0   0   0
# 9  43   0   0   0   0

Zeros are forward-filled and the rows are shifted once (else index 7 in this example will return its own value in C). The values are multiplied by the 1s and 0s of B:

pd.get_dummies(df.A) \
    .mul(df.C, axis=0).replace(0, method="ffill").shift() \
        .mul(pd.get_dummies(df.B))
#     1     2    3     4    5
# 0 NaN   NaN  NaN   NaN  NaN
# 1 NaN   0.0  0.0   0.0  0.0
# 2 NaN   0.0  0.0   0.0  0.0
# 3 NaN   0.0  0.0   0.0  0.0
# 4 NaN   0.0  0.0   0.0  0.0
# 5 NaN   0.0  0.0   0.0  0.0
# 6 NaN  48.0  0.0   0.0  0.0
# 7 NaN  48.0  0.0   0.0  0.0
# 8 NaN  34.0  0.0   0.0  0.0
# 9 NaN   0.0  0.0  20.0  0.0

And finally the maximum value in each row will always be the matching (as all others will be 0 or NaN) - filling resulting NaN values with 0 for consistency.

Upvotes: 0

Related Questions