Reputation: 1000
I have a pandas DataFrame of operations in my application:
+----------+-------------+
| UserName | StartEdit |
+----------+-------------+
| John | 12-Jul-2015 |
| David | 16-Aug-2015 |
| Katie | 20-Aug-2015 |
| Cristin | 2-Sep-2015 |
| Katie | 12-Sep-2015 |
| John | 23-Nov-2015 |
| David | 2-Jan-2016 |
| David | 3-Jan-2016 |
| John | 10-Feb-2016 |
| Steven | 13-Mar-2016 |
| Steven | 14-Mar-2016 |
+----------+-------------+
I would like to create another column with UserTeam. I know that Katie, Cristin and Steven have always been on the same team:
owners_teams = {"Katie":"A", "Cristin":"B", "Steven":"C"}
So when I do df["UserTeam"] = df["UserName"].map(owners_teams)
I get:
+----------+-------------+----------+
| UserName | StartEdit | UserTeam |
+----------+-------------+----------+
| John | 12-Jul-2015 | NaN |
| David | 16-Aug-2015 | NaN |
| Katie | 20-Aug-2015 | A |
| Cristin | 2-Sep-2015 | B |
| Katie | 12-Sep-2015 | A |
| John | 23-Nov-2015 | NaN |
| David | 2-Jan-2016 | NaN |
| David | 3-Jan-2016 | NaN |
| John | 10-Feb-2016 | NaN |
| Steven | 13-Mar-2016 | C |
| Steven | 14-Mar-2016 | C |
+----------+-------------+----------+
Now, I also know that:
John moved from A
to C
on 01-Jan-2016
David moved from B
to C
on 12-Dec-2015
changes = [("John", "01-Jan-2016", "A", "C"), ("David", "12-Dec-2015", "B", "C")]
I know how to do it with apply
and looping over the rows and hardcoding all rules but I don't think it's efficient. How do I do it in a vectorized way for a big number of users?
Expected result:
+----------+-------------+----------+
| UserName | StartEdit | UserTeam |
+----------+-------------+----------+
| John | 12-Jul-2015 | A |
| David | 16-Aug-2015 | B |
| Katie | 20-Aug-2015 | A |
| Cristin | 2-Sep-2015 | B |
| Katie | 12-Sep-2015 | A |
| John | 23-Nov-2015 | A |
| David | 2-Jan-2016 | C |
| David | 3-Jan-2016 | C |
| John | 10-Feb-2016 | C |
| Steven | 13-Mar-2016 | C |
| Steven | 14-Mar-2016 | C |
+----------+-------------+----------+
Upvotes: 2
Views: 63
Reputation: 294338
pd.merge_asof
This is the perfect use case for pd.merge_asof
but requires that you keep track of the changes. Set up another dataframe teams
that does that tracking.
Note
df
.teams = pd.DataFrame([
['Katie', 'A', pd.Timestamp('2015-07-12')],
['Cristin', 'B', pd.Timestamp('2015-07-12')],
['Steven', 'C', pd.Timestamp('2015-07-12')],
['John', 'A', pd.Timestamp('2015-07-12')],
['David', 'B', pd.Timestamp('2015-07-12')],
['David', 'C', pd.Timestamp('2015-12-12')],
['John', 'C', pd.Timestamp('2016-01-01')],
], columns=['UserName', 'Team', 'StartEdit'])
teams
UserName Team StartEdit
0 Katie A 2015-07-12
1 Cristin B 2015-07-12
2 Steven C 2015-07-12
3 John A 2015-07-12
4 David B 2015-07-12
5 David C 2015-12-12
6 John C 2016-01-01
Per the docs, make sure both dataframes are sorted by the relevant date columns.
pd.merge_asof(df, teams, on='StartEdit', by='UserName')
UserName StartEdit Team
0 John 2015-07-12 A
1 David 2015-08-16 B
2 Katie 2015-08-20 A
3 Cristin 2015-09-02 B
4 Katie 2015-09-12 A
5 John 2015-11-23 A
6 David 2016-01-02 C
7 David 2016-01-03 C
8 John 2016-02-10 C
9 Steven 2016-03-13 C
10 Steven 2016-03-14 C
Upvotes: 4
Reputation: 164693
One way is to use pd.DataFrame.loc
within a for
loop. In terms of efficiency, you should test and see whether this is performant for your use case.
changes = [("John", "01-Jan-2016", "A", "C"), ("David", "12-Dec-2015", "B", "C")]
for name, date, before, after in changes:
name_mask = df['UserName'] == name
df.loc[name_mask & (df['StartEdit'] < date), 'UserTeam'] = before
df.loc[name_mask & (df['StartEdit'] >= date), 'UserTeam'] = after
You can also perform the equivalent mapping via numpy.where
.
Upvotes: 3