Reputation: 107
I have the following dataframe:
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]})
df
By using the duplicate function we can get the duplicates.
df.duplicated()
I was wondering if there is a way to mimic the duplicate function without using it instead using data structures such as lists, sets, etc. ?
Upvotes: 1
Views: 275
Reputation: 1841
Interesting question, but I guess more for academic interest.
One possible way, not neccessarily the most efficient one would be:
i = [hash(tuple(i.values())) for i in df.to_dict(orient='records')]
j = [i.count(k)>1 for k in i]
Out[67]: [True, True, False, False, False]
For efficient conparison I took the hash value for each row. dict_values, as well as lists and dicts are not hashable, therefore I converted the values to a tuple, which is hashable. And then count occurences of hash values in the resulting array.
Upvotes: 1