Reputation: 2179
I have this dataset,
PRODUCT_ID SALE_DATE SALE_PRICE PROVIDER
1 01/02/16 25 1
1 02/10/16 60 1
1 01/11/16 63 2
1 09/10/16 65 3
2 11/11/15 54 1
2 13/01/16 34 2
3 19/05/14 45 1
3 15/10/15 38 1
3 16/06/14 53 2
3 18/10/15 58 2
This is a combined dataset, data provided by different providers, there is no common identifier for each sale. The issue here is each data provider will have slightly different date and price for each sale. So I am trying to group them together as single sale with a group id. So the business logic here is data provider 1 is the first to get sale data, so for a product id, if the sale date from provider 2 or 3 is within 1 month time and the price is within 10$ difference (more or less), we consider them as same sale, else it will be considered as different sale. So, the output should look like,
PRODUCT_ID SALE_DATE SALE_PRICE PROVIDER SALE_GROUP_ID
1 01/02/16 25 1 1
1 02/10/16 60 1 2
1 01/11/16 63 2 2
1 09/10/16 65 3 2
2 11/11/15 54 1 3
2 13/01/16 34 2 4
3 19/05/14 45 1 5
3 15/10/15 38 1 6
3 16/06/14 53 2 5
3 18/10/15 58 2 7
How do I achieve this in pandas, can someone help please? Thanks.
Upvotes: 1
Views: 91
Reputation: 323226
We can using groupby
after sort_values
df.SALE_DATE=pd.to_datetime(df.SALE_DATE,format='%d/%m/%y')
df=df.sort_values(['PRODUCT_ID','SALE_DATE','SALE_PRICE'])
df1=df.copy()
df1=df.copy().sort_index()
df1['SALE_GROUP_ID']=df.groupby([df['PRODUCT_ID'],(df['SALE_DATE'].diff().abs()/np.timedelta64(1, 'M')).lt(1).ne(1).cumsum(),df['SALE_PRICE'].diff().lt(10).ne(1).cumsum()]).ngroup().add(1).sort_index().values
df1
Out[294]:
PRODUCT_ID SALE_DATE SALE_PRICE PROVIDER SALE_GROUP_ID
0 1 2016-02-01 25 1 1
1 1 2016-10-02 60 1 2
2 1 2016-11-01 63 2 2
3 1 2016-10-09 65 3 2
4 2 2015-11-11 54 1 3
5 2 2016-01-13 34 2 4
6 3 2014-05-19 45 1 5
7 3 2015-10-15 38 1 6
8 3 2014-06-16 53 2 5
9 3 2015-10-18 58 2 7
More info: you require 3 condition when get the group ,
1. PRODUCT_ID
2. Date different within in 1 month
3. SALE_PRICE different less than 10
(df['SALE_DATE'].diff().abs()/np.timedelta64(1, 'M')).lt(1).ne(1).cumsum()
, is to getthing the group of date within one month
(df['SALE_DATE'].diff().abs()/np.timedelta64(1, 'M')).lt(1).ne(1).cumsum()
Out[274]:
0 1
1 2
3 2
2 2
4 3
5 4
6 5
8 5
7 6
9 6
Name: SALE_DATE, dtype: int32
df['SALE_PRICE'].diff().lt(10).ne(1).cumsum()
is to get the value within 10
df['SALE_PRICE'].diff().lt(10).ne(1).cumsum()
Out[275]:
0 1
1 2
3 2
2 2
4 2
5 2
6 3
8 3
7 3
9 4
Name: SALE_PRICE, dtype: int32
After the groupby
we just combine (getting the intersection of each group condition), then we using ngroup
to get the group Id
Upvotes: 1