Olivier Bilodeau
Olivier Bilodeau

Reputation: 23

python pandas: Keeping earliest data in new column when grouping data

I'm trying to group by a column, find the minimum date value in that group and insert it in a new column for all values in that group.

The following:

d = {'one' : pd.Series(np.random.randn(6), index=pd.date_range('1/1/2011', periods=6, freq='H')),
     'two' : pd.Series(["A", "B", "C"] * 2, index=pd.date_range('1/1/2011', periods=6, freq='H'))}
df = pd.DataFrame(d)
df['ts'] = df.index
df['min_date'] = df.groupby('two')['ts'].min()
df

gives me this output:

Out[7]: 
                          one two                  ts min_date
2011-01-01 00:00:00  1.676829   A 2011-01-01 00:00:00      NaT
2011-01-01 01:00:00 -0.490976   B 2011-01-01 01:00:00      NaT
2011-01-01 02:00:00 -1.934902   C 2011-01-01 02:00:00      NaT
2011-01-01 03:00:00 -0.625931   A 2011-01-01 03:00:00      NaT
2011-01-01 04:00:00  1.534645   B 2011-01-01 04:00:00      NaT
2011-01-01 05:00:00  0.123045   C 2011-01-01 05:00:00      NaT

[6 rows x 4 columns]

Where I would like to have:

Out[7]: 
                          one two                  ts                 min_date
2011-01-01 00:00:00  1.676829   A 2011-01-01 00:00:00      2011-01-01 00:00:00
2011-01-01 01:00:00 -0.490976   B 2011-01-01 01:00:00      2011-01-01 01:00:00
2011-01-01 02:00:00 -1.934902   C 2011-01-01 02:00:00      2011-01-01 02:00:00
2011-01-01 03:00:00 -0.625931   A 2011-01-01 03:00:00      2011-01-01 00:00:00
2011-01-01 04:00:00  1.534645   B 2011-01-01 04:00:00      2011-01-01 01:00:00
2011-01-01 05:00:00  0.123045   C 2011-01-01 05:00:00      2011-01-01 02:00:00

[6 rows x 4 columns]

Grouping on column two so the first occurence of A is set in min_date for all A entries, etc.

Upvotes: 2

Views: 272

Answers (1)

DSM
DSM

Reputation: 353119

I think you want the transform method:

>>> df['min_date'] = df.groupby('two')['ts'].transform("min")
>>> df
                          one two                  ts            min_date
2011-01-01 00:00:00  0.574285   A 2011-01-01 00:00:00 2011-01-01 00:00:00
2011-01-01 01:00:00 -0.200439   B 2011-01-01 01:00:00 2011-01-01 01:00:00
2011-01-01 02:00:00  0.549725   C 2011-01-01 02:00:00 2011-01-01 02:00:00
2011-01-01 03:00:00  1.187299   A 2011-01-01 03:00:00 2011-01-01 00:00:00
2011-01-01 04:00:00  0.770180   B 2011-01-01 04:00:00 2011-01-01 01:00:00
2011-01-01 05:00:00 -0.448781   C 2011-01-01 05:00:00 2011-01-01 02:00:00

[6 rows x 4 columns]

It's for performing an aggregation operation and then broadcasting the result up to the entire group.

Upvotes: 2

Related Questions