ds_user
ds_user

Reputation: 2179

pandas grouping data frame - python

I have this dataset,

PRODUCT_ID   SALE_DATE   SALE_PRICE PROVIDER
1            01/02/16    25         1
1            02/10/16    60         1
1            01/11/16    63         2
1            09/10/16    65         3
2            11/11/15    54         1
2            13/01/16    34         2
3            19/05/14    45         1
3            15/10/15    38         1
3            16/06/14    53         2
3            18/10/15    58         2

This is a combined dataset, data provided by different providers, there is no common identifier for each sale. The issue here is each data provider will have slightly different date and price for each sale. So I am trying to group them together as single sale with a group id. So the business logic here is data provider 1 is the first to get sale data, so for a product id, if the sale date from provider 2 or 3 is within 1 month time and the price is within 10$ difference (more or less), we consider them as same sale, else it will be considered as different sale. So, the output should look like,

PRODUCT_ID   SALE_DATE   SALE_PRICE PROVIDER  SALE_GROUP_ID
1            01/02/16    25         1          1
1            02/10/16    60         1          2
1            01/11/16    63         2          2
1            09/10/16    65         3          2
2            11/11/15    54         1          3 
2            13/01/16    34         2          4
3            19/05/14    45         1          5
3            15/10/15    38         1          6
3            16/06/14    53         2          5
3            18/10/15    58         2          7

How do I achieve this in pandas, can someone help please? Thanks.

Upvotes: 1

Views: 91

Answers (1)

BENY
BENY

Reputation: 323226

We can using groupby after sort_values

df.SALE_DATE=pd.to_datetime(df.SALE_DATE,format='%d/%m/%y')
df=df.sort_values(['PRODUCT_ID','SALE_DATE','SALE_PRICE'])
df1=df.copy()
df1=df.copy().sort_index()


df1['SALE_GROUP_ID']=df.groupby([df['PRODUCT_ID'],(df['SALE_DATE'].diff().abs()/np.timedelta64(1, 'M')).lt(1).ne(1).cumsum(),df['SALE_PRICE'].diff().lt(10).ne(1).cumsum()]).ngroup().add(1).sort_index().values
df1
Out[294]: 
   PRODUCT_ID  SALE_DATE  SALE_PRICE  PROVIDER  SALE_GROUP_ID
0           1 2016-02-01          25         1              1
1           1 2016-10-02          60         1              2
2           1 2016-11-01          63         2              2
3           1 2016-10-09          65         3              2
4           2 2015-11-11          54         1              3
5           2 2016-01-13          34         2              4
6           3 2014-05-19          45         1              5
7           3 2015-10-15          38         1              6
8           3 2014-06-16          53         2              5
9           3 2015-10-18          58         2              7

More info: you require 3 condition when get the group ,

 1. PRODUCT_ID
 2. Date different within in 1 month
 3. SALE_PRICE different less than 10

(df['SALE_DATE'].diff().abs()/np.timedelta64(1, 'M')).lt(1).ne(1).cumsum(), is to getthing the group of date within one month

(df['SALE_DATE'].diff().abs()/np.timedelta64(1, 'M')).lt(1).ne(1).cumsum()
Out[274]: 
0    1
1    2
3    2
2    2
4    3
5    4
6    5
8    5
7    6
9    6
Name: SALE_DATE, dtype: int32

df['SALE_PRICE'].diff().lt(10).ne(1).cumsum() is to get the value within 10

df['SALE_PRICE'].diff().lt(10).ne(1).cumsum()
Out[275]: 
0    1
1    2
3    2
2    2
4    2
5    2
6    3
8    3
7    3
9    4
Name: SALE_PRICE, dtype: int32

After the groupby we just combine (getting the intersection of each group condition), then we using ngroup to get the group Id

Upvotes: 1

Related Questions