Reputation: 12529
I have a large table that looks like the following:
+---+---------+----------+-------+---------+------------+
| | cust_id | order_id | quant | revenue | date |
+---+---------+----------+-------+---------+------------+
| 0 | 103502 | 107801 | 1 | 246.5 | 2014-07-15 |
| 1 | 110301 | 108001 | 2 | 106.0 | 2014-07-15 |
| 2 | 110301 | 108101 | 2 | 106.0 | 2014-07-15 |
| 3 | 111401 | 108201 | 2 | 408.5 | 2014-07-15 |
| 4 | 111502 | 108301 | 1 | 196.0 | 2014-07-15 |
+---+---------+----------+-------+---------+------------+
For customers that purchase more than once (same cust_id), I am trying to group the data together to find the earliest and latest date for each. For example, say cust_id 123456 purchases on 2014-6-30 and 2013-6-30. For my scenario I would like an output of 365 days.
Is this something I would use groupby
for and then apply a function to it? I tried doing a groupby
for the cust_id and date columns, but I was given an object so I'm not sure if it is formatted properly.
Upvotes: 0
Views: 115
Reputation: 159
If you are getting these data from a database table, and that table name is t1, the following query will give the date difference.
select t1.cust_id,max(t2.date)-min(t1.date) from t1, t1 as t2 where t1.cust_id=t2.cust_id group by t1.cust_id;
Upvotes: 0
Reputation: 365915
OK, let's start off with what groupby
does.
(I changed your data to give you different dates, because it's not very interesting if they're all the same.)
>>> d={'cust_id': (103502,110301,110301,111401,111502),
... 'order_id': (107801,108001,108101,108201,108301),
... 'quant': (1,2,2,2,1),
... 'revenue': (246.5,106.0,106.0,408.5,196.0),
... 'date': [datetime.datetime.strptime(dt, '%Y-%m-%d') for dt in
... ('2014-07-15', '2014-07-16', '2014-07-17', '2014-07-18', '2014-07-19')]}
>>> df = pd.DataFrame(d)
>>> df
cust_id date order_id quant revenue
0 103502 2014-07-15 107801 1 246.5
1 110301 2014-07-16 108001 2 106.0
2 110301 2014-07-17 108101 2 106.0
3 111401 2014-07-18 108201 2 408.5
4 111502 2014-07-19 108301 1 196.0
>>> gb = df.groupby('cust_id')
>>> gb.groups
{111401: [3], 111502: [4], 110301: [1, 2], 103502: [0]}
This is pretty clear: each cust_id has a group of associated rows.
So, what can we do with this? Well, just as you can access a DataFrame's columns to get Series, you can access a DataFrameGroupBy's columns to get SeriesGroupBy. And then you call aggregate methods on those columns, which gives you a single value for each group:
>>> gb['date'].max()
cust_id
103502 2014-07-15
110301 2014-07-17
111401 2014-07-18
111502 2014-07-19
Name: date, dtype: datetime64[ns]
So, what (I think) you want is:
>>> gb['date'].max() - gb['date'].min()
cust_id
103502 0 days
110301 1 days
111401 0 days
111502 0 days
Name: date, dtype: timedelta64[ns]
Upvotes: 2