daiyue
daiyue

Reputation: 7448

pandas generates a new column based on values from another column considering duplicates

I am working on a dataframe which has a column that each value is a list, now I want to derive a new column which only considers list whose size is greater than 1, assigns a unique integer to the corresponding row as id. If elements in two lists are the same but with a different order, the two lists should be assigned the same id. A sample dataframe is like,

document_no_list    cluster_id
[1,2,3]             1
[3,2,1]             1
[4,5,6,7]           2
[8]                 0
[9,10]              3
[10,9]              3 

column cluster_id only considers the 1st, 2nd, 3rd, 5th and 6th row, each of which has a size greater than 1, and assigns a unique integer id to its corresponding cell in the column, also [1,2,3], [3,2,1] and [9,10], [10,9] should be assigned the same cluster_id.

I was asking a similar question without considering duplicates list values, at

pandas how to derived values for a new column base on another column

I am wondering how to do that in pandas.

Upvotes: 0

Views: 45

Answers (1)

Ken Wei
Ken Wei

Reputation: 3130

First, you need to assign a column with the list lengths, and another column with the lists as set objects sorted:

df['list_len'] = df.document_no_list.apply(len)
df['list_sorted'] = df.document_no_list.apply(sorted)

Then you need to assign the cluster_id for each set sorted list:

ids = df.loc[df.list_len > 1, ['list_sorted']].drop_duplicates()
ids['cluster_id'] = range(1,len(ids)+1)

Left join this onto the original dataframe, and fill whatever that hasn't been joined (the singletons) with zeros:

df.merge(ids, how = 'left').fillna({'cluster_id':0})

Upvotes: 1

Related Questions