Reputation: 365

How to get duplicated values in a data frame when the column is a list?

Good morning!

I have a data frame with several columns. One of this columns, data, has lists as content. Below I show a little example (id is just an example with random information):

df = 
   id  data
0   a  [1, 2, 3]
1   h  [3, 2, 1]
2  bf  [1, 2, 3]

What I want is to get rows with duplicated values in column data, I mean, in this example, I should get rows 0 and 2, because the values in its column data are the same (list [1, 2, 3]). However, this can't be achieved with df.duplicated(subset = ['data']) due to list is an unhashable type.

I know that it can be done getting two rows and comparing data directly, but my real data frame can have 1000 rows or more, so I can't compare one by one.

Hope someone knows it!

Thanks you very much in advance!

Upvotes: 0

Answers (2)

Adnan Azmat

Reputation: 33

Expanding on Quang's comment:

Try

In [2]: elements = [(1,2,3), (3,2,1), (1,2,3)] 
   ...: df = pd.DataFrame.from_records(elements) 
   ...: df                                                                      
Out[2]: 
   0  1  2
0  1  2  3
1  3  2  1
2  1  2  3

In [3]: # Add a new column of tuples 
   ...: df["new"] = df.apply(lambda x: tuple(x), axis=1) 
   ...: df                                                                      
Out[3]: 
   0  1  2        new
0  1  2  3  (1, 2, 3)
1  3  2  1  (3, 2, 1)
2  1  2  3  (1, 2, 3)

In [4]: # Remove duplicate rows (Keeping the first one) 
   ...: df.drop_duplicates(subset="new", keep="first", inplace=True) 
   ...: df                                                                      
Out[4]: 
   0  1  2        new
0  1  2  3  (1, 2, 3)
1  3  2  1  (3, 2, 1)

In [5]: # Remove the new column if not required 
   ...: df.drop("new", axis=1, inplace=True) 
   ...: df                                                                      
Out[5]: 
   0  1  2
0  1  2  3
1  3  2  1

Upvotes: 0

ansev

Reputation: 30920

IIUC, We can create a new DataFrame from df['data'] and then check with DataFrame.duplicated

You can use:

m = pd.DataFrame(df['data'].tolist()).duplicated(keep=False)

df.loc[m]

   id       data
0   a  [1, 2, 3]
2  bf  [1, 2, 3]

Upvotes: 2

How to get duplicated values in a data frame when the column is a list?

Answers (2)

Related Questions