Reputation: 489
I have a df that looks something like this:
A B C outcome time people_id
. . . 0 34 'ID_4'
. . . 1 23 'ID_2'
. . . 0 2 'ID_1'
. . . 1 85 'ID_4'
I am trying to create a new variable that represents the most recent result of a per each ID but am running into problems as I am not very familiar with pandas. My current attempt looks something like this but I am repeatedly running into problems as I tinker. What is a better way to do this?
def recent_train(x):
_df = train[(train.people_id == x.people_id.values[0]) & (train.time < x.time.values[0])]
min_time = _df.time.min()
avg = _df[_df.time == min_time].outcome.mean()
return avg
train['recent'] = train.apply(lambda x: recent_train(x), axis = 1)
I am using the mean because some of values might be mixed so I want to capture the percentage that are 1.
Upvotes: 0
Views: 55
Reputation: 20891
This should do it if I'm understanding what you want correctly:
import pandas as pd
df = pd.DataFrame({'a': ['A', 'B', 'C', 'D', 'E'],
'outcome': [0, 1, 0, 1, 1],
'time': [34, 34, 2, 85, 34],
'people_id': ['ID_4', 'ID_2', 'ID_1', 'ID_4', 'ID_4']},
columns=['a', 'outcome', 'time', 'people_id'])
mean_outcomes_by_id_and_time = df.groupby(['people_id', 'time'])['outcome'].mean()
most_recent_mean_outcomes_by_id = mean_outcomes_by_id_and_time.groupby(level=[0]).nth(0)
print df
print mean_outcomes_by_id_and_time
print most_recent_mean_outcomes_by_id
output:
a outcome time people_id
0 A 0 34 ID_4
1 B 1 34 ID_2
2 C 0 2 ID_1
3 D 1 85 ID_4
4 E 1 34 ID_4
people_id time
ID_1 2 0.0
ID_2 34 1.0
ID_4 34 0.5
85 1.0
Name: outcome, dtype: float64
people_id
ID_1 0.0
ID_2 1.0
ID_4 0.5
Name: outcome, dtype: float64
Steps:
outcome
for each people_id
and time
(in the form of a multi-indexed series).people_id
and then get the first row in each group, which corresponds to the row with the lowest time
value for each people_id
since groupby()
automatically sorts.You could do it in one line instead of two if you want. I broke it into two for clarity.
Upvotes: 1