Reputation: 923
I am trying to create a new column Group (Cluster ) using the below logic.
LOgic :the script shud check the vendor,text field and the days column if the vendor,text and days value <=2 then the records shud be grouped into 1 cluster
My code
data['Date']=pd.to_datetime(data['Date'],infer_datetime_format=True)
data['Days']=(data['Date'].diff(1).dt.days).fillna(0)
data['Text']=data['Text'].fillna('No Value')
data['Vendor']=data['Vendor'].fillna('No Value')
diff= lambda x: x.diff().fillna(0).gt(2).cumsum()
t = data.groupby(['Text', 'Vendor']).Date_Difference.transform(diff)
g = data.groupby(['Text', 'Vendor', t], sort=False).ngroup()
data=data.assign(Group=g.add(1).astype(str).radd('Cluster'))
My current Output
Invoice Date Text Vendor Days Group
1234567 1/1/2012 Repairs A 0 Cluster1
1234568 2/1/2012 Repairs A 1 Cluster1
1234569 4/1/2012 Repairs A 2 Cluster1
1234570 6/1/2012 Water A 2 Cluster2
1234571 9/1/2012 Water A 3 Cluster2
1234572 9/1/2012 Car A 0 Cluster3
1234573 9/1/2012 Bus A 0 Cluster4
1234574 9/1/2012 Bike A 0 Cluster5
1234575 9/1/2012 Repairs A 0 Cluster6
1234576 10/1/2012 Repairs A 1 Cluster6
1234577 11/1/2012 Repairs A 1 Cluster6
1234578 12/1/2012 Water A 2 Cluster6
1234579 13/1/2012 Water A 1 Cluster2
1234580 14/1/2012 Water A 1 Cluster2
Expected Output
Invoice Date Text Vendor Days Group
1234567 1/1/2012 Repairs A 0 Cluster1
1234568 2/1/2012 Repairs A 1 Cluster1
1234569 4/1/2012 Repairs A 2 Cluster1
1234570 6/1/2012 Water A 2 Cluster2
1234571 9/1/2012 Water A 3 Cluster2
1234572 9/1/2012 Car A 0 No Cluster
1234573 9/1/2012 Bus A 0 No Cluster
1234574 9/1/2012 Bike A 0 No Cluster
1234575 9/1/2012 Repairs A 0 Cluster3
1234576 10/1/2012 Repairs A 1 Cluster3
1234577 11/1/2012 Repairs A 1 Cluster3
1234578 12/1/2012 Water A 2 Cluster4
1234579 13/1/2012 Water A 1 Cluster4
1234580 14/1/2012 Water A 1 Cluster4
Testing Data
Invoice Date Text Vendor Days Group Expected Group
1000001 1/1/2012 Repair A 0 Cluster1 Cluster1
1000003 2/1/2012 Repair A 1 Cluster1 Cluster1
1000005 4/1/2012 Repair A 2 Cluster1 Cluster1
1000007 6/1/2012 Water A 2 No Cluster No Cluster
1000008 9/2/2012 Repair A 34 Cluster2 No Cluster
1000010 9/2/2012 Garden A 0 Cluster3 Cluster2
1000011 10/2/2012 Garden A 1 Cluster3 Cluster2
1000012 15/2/2012 Car A 5 Cluster4 Cluster3
1000013 16/2/2012 Car A 1 Cluster4 Cluster3
1000015 17/2/2012 Car A 1 Cluster4 Cluster3
1234574 17/2/2012 Bike A 0 No Cluster No Cluster
How can this be done in python?
Upvotes: 0
Views: 103
Reputation: 863331
Idea is create new Series g1
for consecutive groups by Text
and Vendor
columns and fill only duplicated by helper Series
g1
, last add non matched value by Series.reindex
:
data['Date']=pd.to_datetime(data['Date'],infer_datetime_format=True, dayfirst=True)
data.sort_values(['Vendor','Date'],inplace=True)
data['Date_Difference']=(data['Date'].diff(1).dt.days).fillna(0)
data['Text']=data['Text'].fillna('No Value')
data['Vendor']=data['Vendor'].fillna('No Value')
diff= lambda x: x.diff().fillna(0).gt(2).cumsum()
t = data.groupby(['Text', 'Vendor']).Date_Difference.transform(diff)
g1 = data[['Text', 'Vendor']].ne(data[['Text', 'Vendor']].shift()).any(axis=1).cumsum()
m = g1.duplicated(keep=False)
g = data[m].groupby([g1, t], sort=False).ngroup()
clust = g.add(1).astype(str).radd('Cluster').reindex(data.index, fill_value='No Cluster')
data=data.assign(Group=clust)
print (data)
Invoice Date Text Vendor Days Group Date_Difference
0 1000001 2012-01-01 Repair A 0 Cluster1 0.0
1 1000003 2012-01-02 Repair A 1 Cluster1 1.0
2 1000005 2012-01-04 Repair A 2 Cluster1 2.0
3 1000007 2012-01-06 Water A 2 No Cluster 2.0
4 1000008 2012-02-09 Repair A 34 No Cluster 34.0
5 1000010 2012-02-09 Garden A 0 Cluster2 0.0
6 1000011 2012-02-10 Garden A 1 Cluster2 1.0
7 1000012 2012-02-15 Car A 5 Cluster3 5.0
8 1000013 2012-02-16 Car A 1 Cluster3 1.0
9 1000015 2012-02-17 Car A 1 Cluster3 1.0
10 1234574 2012-02-17 Bike A 0 No Cluster 0.0
Upvotes: 1
Reputation: 1
Just add this to your existing code, and you're set:
for i in range(len(data)):
if data.Group.isin(list(data.Group.value_counts()[data.Group.value_counts() == 1].index))[i]:
data.loc[i,'Group'] = 'No Cluster'
Upvotes: 0