Michael Mathews Jr.
Michael Mathews Jr.

Reputation: 329

Groupby to Identify minimum dates for each group

I'm trying to write a lambda function in pandas that allows me to identify the minimum "date" of a subgroup of data. I'm trying to use a lambda function and pandas groupby to solve this. Code:

df['min_asd'] = df.anticipatedstartdate.map(
lambda x: 1 if df.groupby('learners_id').anticipatedstartdate == x.min() else 0)

The goal is to get the minimum date BY the learners id (student id), not the golbal minimum date.

The error message I receive:

TypeError: 'Timestamp' object is not callable

Upvotes: 0

Views: 1723

Answers (1)

André C. Andersen
André C. Andersen

Reputation: 9375

Unfortunately, you can't use the map() function in conjugation with the groupby() function like that. The map() function takes a lambda which expects a single value (x) for each row of the series (anticipatedstartdate). That means it is just one Timestamp and because of that it does not have a max() aggregation function like you'd expect of a (numpy) array. What it does have is a max property on the class level, which the x object gladly provides. This is simply the constant Timestamp('2262-04-11 23:47:16.854775807') defined here.

The reason you are getting the error:

TypeError: 'Timestamp' object is not callable

Is that this max constant timestamp isn't callable, because it is just an object which does not have the __call__() function implemented. When trying to use an object as a function python tries to use the __call__() function and fails with a TypeError: More about __call__()

Anyway. In the future you should probably include some test data, but I think I managed to find a suitable example:

The following should do it:

import pandas as pd
import numpy as np
df = pd.DataFrame(
    {
        'learners_id': [1, 2, 3, 1, 2, 3, 1, 2, 3],
        'anticipatedstartdate': [
            '2020-01-04',
            '2020-01-05',
            '2020-01-03',
            '2020-01-07',
            '2020-01-01',
            '2020-01-08',
            '2020-01-09',
            '2020-01-06',
            '2020-01-02',
        ]
    }
)
df['anticipatedstartdate'] = pd.to_datetime(df['anticipatedstartdate'])
df['min_asd'] = 1 * (
    df.groupby('learners_id')['anticipatedstartdate'].transform('min') == df['anticipatedstartdate']
)

This outputs:

   learners_id anticipatedstartdate  min_asd
0            1           2020-01-04        1
1            2           2020-01-05        0
2            3           2020-01-03        0
3            1           2020-01-07        0
4            2           2020-01-01        1
5            3           2020-01-08        0
6            1           2020-01-09        0
7            2           2020-01-06        0
8            3           2020-01-02        1

Normally you would use an aggregation function directly on the result of the groupby() function and the series selector (projection if you will), like this:

df.groupby('learners_id')['anticipatedstartdate'].min()

learners_id
1   2020-01-04
2   2020-01-01
3   2020-01-02

However, this aggregates the results, but what you want is to keep the results per entry so that you keep the granularity of your original dataset. This can be done using the .transform() function instead:

df.groupby('learners_id')['anticipatedstartdate'].transform('min')

0   2020-01-04
1   2020-01-01
2   2020-01-02
3   2020-01-04
4   2020-01-01
5   2020-01-02
6   2020-01-04
7   2020-01-01
8   2020-01-02

Now because you the minimum value of anticipatedstartdate for each of your original record, you can simply do an equality check on the original anticipatedstartdate series:

df.groupby('learners_id')['anticipatedstartdate'].transform('min') == df['anticipatedstartdate']

0     True
1    False
2    False
3    False
4     True
5    False
6    False
7    False
8     True

This is the result we are looking for, but it seemed like you wanted them as {0, 1} integer. It's easy to coerce bools to those by just multiplying with 1 giving you:

df['min_asd'] = 1 * (
    df.groupby('learners_id')['anticipatedstartdate'].transform('min') == df['anticipatedstartdate']
)

Upvotes: 1

Related Questions