J_Heads
J_Heads

Reputation: 489

Create a new variable to represent the last result in a time series with pandas

I have a df that looks something like this:

A  B  C  outcome time people_id
.  .  .    0     34  'ID_4'
.  .  .    1     23  'ID_2'
.  .  .    0     2   'ID_1'
.  .  .    1     85  'ID_4'

I am trying to create a new variable that represents the most recent result of a per each ID but am running into problems as I am not very familiar with pandas. My current attempt looks something like this but I am repeatedly running into problems as I tinker. What is a better way to do this?

 def recent_train(x):
    _df = train[(train.people_id == x.people_id.values[0]) & (train.time < x.time.values[0])]
    min_time = _df.time.min()
    avg = _df[_df.time == min_time].outcome.mean()
    return avg

train['recent'] = train.apply(lambda x: recent_train(x), axis = 1)

I am using the mean because some of values might be mixed so I want to capture the percentage that are 1.

Upvotes: 0

Views: 55

Answers (1)

MarredCheese
MarredCheese

Reputation: 20891

This should do it if I'm understanding what you want correctly:

import pandas as pd

df = pd.DataFrame({'a': ['A', 'B', 'C', 'D', 'E'],
                   'outcome': [0, 1, 0, 1, 1],
                   'time': [34, 34, 2, 85, 34],
                   'people_id': ['ID_4', 'ID_2', 'ID_1', 'ID_4', 'ID_4']},
                   columns=['a', 'outcome', 'time', 'people_id'])

mean_outcomes_by_id_and_time = df.groupby(['people_id', 'time'])['outcome'].mean()
most_recent_mean_outcomes_by_id = mean_outcomes_by_id_and_time.groupby(level=[0]).nth(0)

print df
print mean_outcomes_by_id_and_time
print most_recent_mean_outcomes_by_id

output:

   a  outcome  time people_id
0  A        0    34      ID_4
1  B        1    34      ID_2
2  C        0     2      ID_1
3  D        1    85      ID_4
4  E        1    34      ID_4

people_id  time
ID_1       2       0.0
ID_2       34      1.0
ID_4       34      0.5
           85      1.0
Name: outcome, dtype: float64

people_id
ID_1    0.0
ID_2    1.0
ID_4    0.5
Name: outcome, dtype: float64

Steps:

  1. Get the mean outcome for each people_id and time (in the form of a multi-indexed series).
  2. Group by people_id and then get the first row in each group, which corresponds to the row with the lowest time value for each people_id since groupby() automatically sorts.

You could do it in one line instead of two if you want. I broke it into two for clarity.

Upvotes: 1

Related Questions