Reputation: 15
I have a large pandas dataframe of ~2million rows. Columns A and B are both ID columns, where ID==1 in Column A is the same ID entity as ID==1 in column B. Column C is the target value column.
Is there an efficient way, using pandas' native functions such as groupby etc or vectorisation methods, and avoiding looping, to:
For each row, lookup the previous occurrence in column A of row B's ID value, and get the previous occurrence row's column C value?
I've tried various groupby/rolling methods to no avail. Many thanks in advance.
Upvotes: 0
Views: 54
Reputation: 2787
You can use a combination of pd.get_dummies
and pd.mul
.
A simple example:
A B C
0 2 4 48
1 1 3 14
2 4 4 19
3 5 5 20
4 4 3 20
5 1 3 41
6 3 2 45
7 2 2 34
8 1 2 44
9 1 4 43
Here, indices 6-9 contain IDs in column B
that occur in previous rows of column A
. 6 and 7 match with index 0, 8 matches with index 7 and 9 with index 4.
# read data above
df = pd.read_clipboard()
# append "Out" column to df with the desired output
df["Out"] = pd.get_dummies(df.A) \
.mul(df.C, axis=0).replace(0, method="ffill").shift() \
.mul(pd.get_dummies(df.B)).max(axis=1).fillna(0)
df
#Out
# A B C Out
# 0 2 4 48 0.0
# 1 1 3 14 0.0
# 2 4 4 19 0.0
# 3 5 5 20 0.0
# 4 4 3 20 0.0
# 5 1 3 41 0.0
# 6 3 2 45 48.0
# 7 2 2 34 48.0
# 8 1 2 44 34.0
# 9 1 4 43 20.0
Breaking this down, the first step creates a pd.DataFrame of 1s and 0s for the IDs of column A
. Then this is multiplied by the values in column C
:
pd.get_dummies(df.A).mul(df.C, axis=0)
# 1 2 3 4 5
# 0 0 48 0 0 0
# 1 14 0 0 0 0
# 2 0 0 0 19 0
# 3 0 0 0 0 20
# 4 0 0 0 20 0
# 5 41 0 0 0 0
# 6 0 0 45 0 0
# 7 0 34 0 0 0
# 8 44 0 0 0 0
# 9 43 0 0 0 0
Zeros are forward-filled and the rows are shifted once (else index 7 in this example will return its own value in C
). The values are multiplied by the 1s and 0s of B:
pd.get_dummies(df.A) \
.mul(df.C, axis=0).replace(0, method="ffill").shift() \
.mul(pd.get_dummies(df.B))
# 1 2 3 4 5
# 0 NaN NaN NaN NaN NaN
# 1 NaN 0.0 0.0 0.0 0.0
# 2 NaN 0.0 0.0 0.0 0.0
# 3 NaN 0.0 0.0 0.0 0.0
# 4 NaN 0.0 0.0 0.0 0.0
# 5 NaN 0.0 0.0 0.0 0.0
# 6 NaN 48.0 0.0 0.0 0.0
# 7 NaN 48.0 0.0 0.0 0.0
# 8 NaN 34.0 0.0 0.0 0.0
# 9 NaN 0.0 0.0 20.0 0.0
And finally the maximum value in each row will always be the matching (as all others will be 0
or NaN
) - filling resulting NaN
values with 0 for consistency.
Upvotes: 0