Reputation: 7448
I am working on a dataframe
which has a column that each value is a list, now I want to derive a new column which only considers list whose size is greater than 1, assigns a unique integer to the corresponding row as id. If elements in two lists are the same but with a different order, the two lists should be assigned the same id. A sample dataframe
is like,
document_no_list cluster_id
[1,2,3] 1
[3,2,1] 1
[4,5,6,7] 2
[8] 0
[9,10] 3
[10,9] 3
column cluster_id
only considers the 1st, 2nd, 3rd, 5th and 6th row, each of which has a size greater than 1, and assigns a unique integer id to its corresponding cell in the column, also [1,2,3]
, [3,2,1]
and [9,10]
, [10,9]
should be assigned the same cluster_id
.
I was asking a similar question without considering duplicates list values, at
pandas how to derived values for a new column base on another column
I am wondering how to do that in pandas.
Upvotes: 0
Views: 45
Reputation: 3130
First, you need to assign a column with the list lengths, and another column with the lists as set objects sorted:
df['list_len'] = df.document_no_list.apply(len)
df['list_sorted'] = df.document_no_list.apply(sorted)
Then you need to assign the cluster_id
for each set sorted list:
ids = df.loc[df.list_len > 1, ['list_sorted']].drop_duplicates()
ids['cluster_id'] = range(1,len(ids)+1)
Left join this onto the original dataframe, and fill whatever that hasn't been joined (the singletons) with zeros:
df.merge(ids, how = 'left').fillna({'cluster_id':0})
Upvotes: 1