pawelty
pawelty

Reputation: 1000

Efficient way of conditional value setting

I have a pandas DataFrame of operations in my application:

+----------+-------------+
| UserName |  StartEdit  |
+----------+-------------+
| John     | 12-Jul-2015 |
| David    | 16-Aug-2015 |
| Katie    | 20-Aug-2015 |
| Cristin  | 2-Sep-2015  |
| Katie    | 12-Sep-2015 |
| John     | 23-Nov-2015 |
| David    | 2-Jan-2016  |
| David    | 3-Jan-2016  |
| John     | 10-Feb-2016 |
| Steven   | 13-Mar-2016 |
| Steven   | 14-Mar-2016 |
+----------+-------------+

I would like to create another column with UserTeam. I know that Katie, Cristin and Steven have always been on the same team:

owners_teams = {"Katie":"A", "Cristin":"B", "Steven":"C"}

So when I do df["UserTeam"] = df["UserName"].map(owners_teams) I get:

+----------+-------------+----------+
| UserName |  StartEdit  | UserTeam |
+----------+-------------+----------+
| John     | 12-Jul-2015 | NaN      |
| David    | 16-Aug-2015 | NaN      |
| Katie    | 20-Aug-2015 | A        |
| Cristin  | 2-Sep-2015  | B        |
| Katie    | 12-Sep-2015 | A        |
| John     | 23-Nov-2015 | NaN      |
| David    | 2-Jan-2016  | NaN      |
| David    | 3-Jan-2016  | NaN      |
| John     | 10-Feb-2016 | NaN      |
| Steven   | 13-Mar-2016 | C        |
| Steven   | 14-Mar-2016 | C        |
+----------+-------------+----------+

Now, I also know that:

John moved from A to C on 01-Jan-2016

David moved from B to C on 12-Dec-2015

changes = [("John", "01-Jan-2016", "A", "C"), ("David", "12-Dec-2015", "B", "C")]

I know how to do it with apply and looping over the rows and hardcoding all rules but I don't think it's efficient. How do I do it in a vectorized way for a big number of users?

Expected result:

+----------+-------------+----------+
| UserName |  StartEdit  | UserTeam |
+----------+-------------+----------+
| John     | 12-Jul-2015 | A        |
| David    | 16-Aug-2015 | B        |
| Katie    | 20-Aug-2015 | A        |
| Cristin  | 2-Sep-2015  | B        |
| Katie    | 12-Sep-2015 | A        |
| John     | 23-Nov-2015 | A        |
| David    | 2-Jan-2016  | C        |
| David    | 3-Jan-2016  | C        |
| John     | 10-Feb-2016 | C        |
| Steven   | 13-Mar-2016 | C        |
| Steven   | 14-Mar-2016 | C        |
+----------+-------------+----------+

Upvotes: 2

Views: 63

Answers (2)

piRSquared
piRSquared

Reputation: 294338

pd.merge_asof

This is the perfect use case for pd.merge_asof but requires that you keep track of the changes. Set up another dataframe teams that does that tracking.

Note

  • I used an initial date that was the minimum date found in df.
  • I set up John and David with initial teams you mentioned as well as initial dates.
  • I added another entry to show when John and David changed teams.

teams = pd.DataFrame([
    ['Katie', 'A', pd.Timestamp('2015-07-12')],
    ['Cristin', 'B', pd.Timestamp('2015-07-12')],
    ['Steven', 'C', pd.Timestamp('2015-07-12')],
    ['John', 'A', pd.Timestamp('2015-07-12')],
    ['David', 'B', pd.Timestamp('2015-07-12')],
    ['David', 'C', pd.Timestamp('2015-12-12')],
    ['John', 'C', pd.Timestamp('2016-01-01')],
], columns=['UserName', 'Team', 'StartEdit'])

teams

  UserName Team  StartEdit
0    Katie    A 2015-07-12
1  Cristin    B 2015-07-12
2   Steven    C 2015-07-12
3     John    A 2015-07-12
4    David    B 2015-07-12
5    David    C 2015-12-12
6     John    C 2016-01-01

Per the docs, make sure both dataframes are sorted by the relevant date columns.

pd.merge_asof(df, teams, on='StartEdit', by='UserName')

   UserName  StartEdit Team
0      John 2015-07-12    A
1     David 2015-08-16    B
2     Katie 2015-08-20    A
3   Cristin 2015-09-02    B
4     Katie 2015-09-12    A
5      John 2015-11-23    A
6     David 2016-01-02    C
7     David 2016-01-03    C
8      John 2016-02-10    C
9    Steven 2016-03-13    C
10   Steven 2016-03-14    C

Upvotes: 4

jpp
jpp

Reputation: 164693

One way is to use pd.DataFrame.loc within a for loop. In terms of efficiency, you should test and see whether this is performant for your use case.

changes = [("John", "01-Jan-2016", "A", "C"), ("David", "12-Dec-2015", "B", "C")]

for name, date, before, after in changes:
    name_mask = df['UserName'] == name
    df.loc[name_mask & (df['StartEdit'] < date), 'UserTeam'] = before
    df.loc[name_mask & (df['StartEdit'] >= date), 'UserTeam'] = after

You can also perform the equivalent mapping via numpy.where.

Upvotes: 3

Related Questions