Reputation: 329
I'm trying to write a lambda function in pandas that allows me to identify the minimum "date" of a subgroup of data. I'm trying to use a lambda function and pandas groupby to solve this. Code:
df['min_asd'] = df.anticipatedstartdate.map(
lambda x: 1 if df.groupby('learners_id').anticipatedstartdate == x.min() else 0)
The goal is to get the minimum date BY the learners id (student id), not the golbal minimum date.
The error message I receive:
TypeError: 'Timestamp' object is not callable
Upvotes: 0
Views: 1723
Reputation: 9375
Unfortunately, you can't use the map()
function in conjugation with the groupby()
function like that. The map()
function takes a lambda which expects a single value (x
) for each row of the series (anticipatedstartdate
). That means it is just one Timestamp
and because of that it does not have a max()
aggregation function like you'd expect of a (numpy) array. What it does have is a max
property on the class level, which the x
object gladly provides. This is simply the constant Timestamp('2262-04-11 23:47:16.854775807')
defined here.
The reason you are getting the error:
TypeError: 'Timestamp' object is not callable
Is that this max constant timestamp isn't callable, because it is just an object which does not have the __call__()
function implemented. When trying to use an object as a function python tries to use the __call__()
function and fails with a TypeError
: More about __call__()
Anyway. In the future you should probably include some test data, but I think I managed to find a suitable example:
The following should do it:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'learners_id': [1, 2, 3, 1, 2, 3, 1, 2, 3],
'anticipatedstartdate': [
'2020-01-04',
'2020-01-05',
'2020-01-03',
'2020-01-07',
'2020-01-01',
'2020-01-08',
'2020-01-09',
'2020-01-06',
'2020-01-02',
]
}
)
df['anticipatedstartdate'] = pd.to_datetime(df['anticipatedstartdate'])
df['min_asd'] = 1 * (
df.groupby('learners_id')['anticipatedstartdate'].transform('min') == df['anticipatedstartdate']
)
This outputs:
learners_id anticipatedstartdate min_asd
0 1 2020-01-04 1
1 2 2020-01-05 0
2 3 2020-01-03 0
3 1 2020-01-07 0
4 2 2020-01-01 1
5 3 2020-01-08 0
6 1 2020-01-09 0
7 2 2020-01-06 0
8 3 2020-01-02 1
Normally you would use an aggregation function directly on the result of the groupby()
function and the series selector (projection if you will), like this:
df.groupby('learners_id')['anticipatedstartdate'].min()
learners_id
1 2020-01-04
2 2020-01-01
3 2020-01-02
However, this aggregates the results, but what you want is to keep the results per entry so that you keep the granularity of your original dataset. This can be done using the .transform()
function instead:
df.groupby('learners_id')['anticipatedstartdate'].transform('min')
0 2020-01-04
1 2020-01-01
2 2020-01-02
3 2020-01-04
4 2020-01-01
5 2020-01-02
6 2020-01-04
7 2020-01-01
8 2020-01-02
Now because you the minimum value of anticipatedstartdate
for each of your original record, you can simply do an equality check on the original anticipatedstartdate
series:
df.groupby('learners_id')['anticipatedstartdate'].transform('min') == df['anticipatedstartdate']
0 True
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 True
This is the result we are looking for, but it seemed like you wanted them as {0, 1} integer. It's easy to coerce bools to those by just multiplying with 1 giving you:
df['min_asd'] = 1 * (
df.groupby('learners_id')['anticipatedstartdate'].transform('min') == df['anticipatedstartdate']
)
Upvotes: 1