Reputation: 517
I have a Pandas DataFrame with the following format:
In [0]: df
Out[0]:
col1 col2 date
0 1 1 2015-01-01
1 1 2 2015-01-09
2 1 3 2015-01-10
3 2 1 2015-02-10
4 2 2 2015-02-10
5 2 3 2015-02-25
In [1]: df.dtypes
Out[1]:
col1 int64
col2 int64
date datetime64[ns]
dtype: object
We want to find the value for col2
corresponding to the greatest difference in date (between consecutive elements in the sorted-by-dates groups), grouped by col1
. Assume there are no groups of size 1.
Desired Output
In [2]: output
Out[2]:
col1 col2
1 1 # This is because the difference between 2015-01-09 and 2015-01-01 is the greatest
2 2 # This is because the difference between 2015-02-25 and 2015-02-10 is the greatest
The real df
has many values for col1
that we need to groupby to do calculations. Is this possible by applying a function to the following? Please note, the dates are already in ascending order.
gb = df.groupby(col1)
gb.apply(right_maximum_date_difference)
Upvotes: 3
Views: 1905
Reputation: 8906
I would try a slightly differnt tack: Pivot the table so that you have a column for each value in col2
containing the dates and the values of col1
as the index. Then you can use the .diff
method to get the differences between consecutive cells. This might not work if there are duplicate col1
, col2
pairs though, which is not clear from the question.
df = pd.DataFrame({'col1': [1, 1, 1, 2, 2, 2],
'col2': [1, 2, 3, 1, 2, 3],
'date': pd.to_datetime(['2015-01-01', '2015-01-09', '2015-01-10',
'2015-02-10', '2015-02-10', '2015-02-25'])})
p = df.pivot(columns='col1', index='col2', values='date')
p
col1 1 2
col2
1 2015-01-01 2015-02-10
2 2015-01-09 2015-02-10
3 2015-01-10 2015-02-25
p.diff().shift(-1).idxmax()
col1
1 1
2 2
The .shift(-1)
takes care of the fact that you want the first of the two consecutive dates with the largest difference.
Upvotes: 3
Reputation: 76297
Here's something that's almost your dataframe (I avoided copying the dates):
df = pd.DataFrame({
'col1': [1, 1, 1, 2, 2, 2],
'col2': [1, 2, 3, 1, 2, 3],
'date': [1, 9, 10, 10, 10, 25]
})
With this, define:
def max_diff_date(g):
g = g.sort(columns=['date'])
return g.col2.ix[(g.date.ix[1: ] - g.date.shift(1)).argmax() - 1]
and you have:
>> df.groupby(df.col1).apply(max_diff_date)
col1
1 1
2 2
dtype: int64
Upvotes: 2