aeapen
aeapen

Reputation: 923

How to assign unique column name if the same set pattern repeats again in data frame using Pandas?

I am trying to create a new column Group (Cluster ) using the below logic.

LOgic :the script shud check the vendor,text field and the days column if the vendor,text and days value <=2 then the records shud be grouped into 1 cluster

My code

data['Date']=pd.to_datetime(data['Date'],infer_datetime_format=True)
data['Days']=(data['Date'].diff(1).dt.days).fillna(0)
data['Text']=data['Text'].fillna('No Value')
data['Vendor']=data['Vendor'].fillna('No Value')
diff= lambda x: x.diff().fillna(0).gt(2).cumsum()
t = data.groupby(['Text', 'Vendor']).Date_Difference.transform(diff)
g = data.groupby(['Text', 'Vendor', t], sort=False).ngroup()
data=data.assign(Group=g.add(1).astype(str).radd('Cluster'))

My current Output

 Invoice    Date    Text    Vendor  Days    Group       
1234567 1/1/2012    Repairs A       0        Cluster1   
1234568 2/1/2012    Repairs A       1        Cluster1   
1234569 4/1/2012    Repairs A       2        Cluster1
1234570 6/1/2012    Water   A       2        Cluster2  
1234571 9/1/2012    Water   A       3        Cluster2
1234572 9/1/2012    Car     A       0        Cluster3
1234573 9/1/2012    Bus     A       0        Cluster4
1234574 9/1/2012    Bike    A       0        Cluster5
1234575 9/1/2012    Repairs A       0        Cluster6
1234576 10/1/2012   Repairs A       1        Cluster6
1234577 11/1/2012   Repairs A       1        Cluster6
1234578 12/1/2012   Water   A       2        Cluster6
1234579 13/1/2012   Water   A       1        Cluster2
1234580 14/1/2012   Water   A       1        Cluster2

Expected Output

 Invoice    Date        Text    Vendor  Days    Group
    1234567 1/1/2012    Repairs A       0        Cluster1
    1234568 2/1/2012    Repairs A       1        Cluster1
    1234569 4/1/2012    Repairs A       2        Cluster1
    1234570 6/1/2012    Water   A       2        Cluster2
    1234571 9/1/2012    Water   A       3        Cluster2
    1234572 9/1/2012    Car     A       0        No Cluster
    1234573 9/1/2012    Bus     A       0        No Cluster
    1234574 9/1/2012    Bike    A       0        No Cluster
    1234575 9/1/2012    Repairs A       0        Cluster3
    1234576 10/1/2012   Repairs A       1        Cluster3
    1234577 11/1/2012   Repairs A       1        Cluster3
    1234578 12/1/2012   Water   A       2        Cluster4
    1234579 13/1/2012   Water   A       1        Cluster4
    1234580 14/1/2012   Water   A       1        Cluster4

Testing Data

  Invoice     Date      Text   Vendor   Days    Group   Expected Group
    1000001 1/1/2012    Repair  A       0   Cluster1     Cluster1
    1000003 2/1/2012    Repair  A       1   Cluster1     Cluster1
    1000005 4/1/2012    Repair  A       2   Cluster1     Cluster1
    1000007 6/1/2012    Water   A       2   No Cluster   No Cluster
    1000008 9/2/2012    Repair  A       34  Cluster2     No Cluster
    1000010 9/2/2012    Garden  A       0   Cluster3     Cluster2
    1000011 10/2/2012   Garden  A       1   Cluster3     Cluster2
    1000012 15/2/2012   Car     A       5   Cluster4     Cluster3
    1000013 16/2/2012   Car     A       1   Cluster4     Cluster3
    1000015 17/2/2012   Car     A       1   Cluster4     Cluster3
    1234574 17/2/2012   Bike    A       0   No Cluster   No Cluster

How can this be done in python?

Upvotes: 0

Views: 103

Answers (2)

jezrael
jezrael

Reputation: 863331

Idea is create new Series g1 for consecutive groups by Text and Vendor columns and fill only duplicated by helper Series g1, last add non matched value by Series.reindex:

data['Date']=pd.to_datetime(data['Date'],infer_datetime_format=True, dayfirst=True)
data.sort_values(['Vendor','Date'],inplace=True)
data['Date_Difference']=(data['Date'].diff(1).dt.days).fillna(0)
data['Text']=data['Text'].fillna('No Value')
data['Vendor']=data['Vendor'].fillna('No Value')
diff= lambda x: x.diff().fillna(0).gt(2).cumsum()
t = data.groupby(['Text', 'Vendor']).Date_Difference.transform(diff)

g1 = data[['Text', 'Vendor']].ne(data[['Text', 'Vendor']].shift()).any(axis=1).cumsum()
m = g1.duplicated(keep=False)

g = data[m].groupby([g1, t], sort=False).ngroup()
clust = g.add(1).astype(str).radd('Cluster').reindex(data.index, fill_value='No Cluster')

data=data.assign(Group=clust)
print (data)
    Invoice       Date    Text Vendor  Days       Group  Date_Difference
0   1000001 2012-01-01  Repair      A     0    Cluster1              0.0
1   1000003 2012-01-02  Repair      A     1    Cluster1              1.0
2   1000005 2012-01-04  Repair      A     2    Cluster1              2.0
3   1000007 2012-01-06   Water      A     2  No Cluster              2.0
4   1000008 2012-02-09  Repair      A    34  No Cluster             34.0
5   1000010 2012-02-09  Garden      A     0    Cluster2              0.0
6   1000011 2012-02-10  Garden      A     1    Cluster2              1.0
7   1000012 2012-02-15     Car      A     5    Cluster3              5.0
8   1000013 2012-02-16     Car      A     1    Cluster3              1.0
9   1000015 2012-02-17     Car      A     1    Cluster3              1.0
10  1234574 2012-02-17    Bike      A     0  No Cluster              0.0

Upvotes: 1

Benjamin Lim
Benjamin Lim

Reputation: 1

  1. Count entries per cluster and isolate entries which have no others in their cluster
  2. Set the 'Group' of these entries to 'No Cluster'

Just add this to your existing code, and you're set:

for i in range(len(data)):
    if data.Group.isin(list(data.Group.value_counts()[data.Group.value_counts() == 1].index))[i]:
        data.loc[i,'Group'] = 'No Cluster'

Upvotes: 0

Related Questions