Mr.Muu
Mr.Muu

Reputation: 107

Detecting duplicates in pandas without the duplicate function

I have the following dataframe:

df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]})

df

By using the duplicate function we can get the duplicates.

df.duplicated()

I was wondering if there is a way to mimic the duplicate function without using it instead using data structures such as lists, sets, etc. ?

Upvotes: 1

Views: 275

Answers (1)

FloLie
FloLie

Reputation: 1841

Interesting question, but I guess more for academic interest.

One possible way, not neccessarily the most efficient one would be:

i = [hash(tuple(i.values())) for i in df.to_dict(orient='records')]
j = [i.count(k)>1 for k in i]
Out[67]: [True, True, False, False, False]

For efficient conparison I took the hash value for each row. dict_values, as well as lists and dicts are not hashable, therefore I converted the values to a tuple, which is hashable. And then count occurences of hash values in the resulting array.

Upvotes: 1

Related Questions