Agnij
Agnij

Reputation: 581

Python Pandas : Drop Duplicates Function - Unusual Behaviour

The error -> TypeError: unhashable type: 'list' disappears after saving the data frame and loading it again ...

Both data frames [saved and loaded, generated] have the same dtypes ...

Reproducible ->

--> import pandas as pd
--> l1 = [[1], [1], [1], [1], [1], [1], [1], [1], [6], [1], [6], [1], [6], [6], [6], [6], [6], [6], [6], [6], [6]]

## len(l1) is 21 ##

--> l2 = ['a']*21
--> l3 = ['c']*10 + ['d']*10 + ['e']
--> df = pd.DataFrame()
--> df['col1'], df['col2'], df['col3'] = l1, l3, l2
--> df
        col1 col2 col3
        0   [1]    c    a
        1   [1]    c    a
        2   [1]    c    a
        3   [1]    c    a
        4   [1]    c    a
        5   [1]    c    a
        6   [1]    c    a
        7   [1]    c    a
        8   [6]    c    a
        9   [1]    c    a
        10  [6]    d    a
        11  [1]    d    a
        12  [6]    d    a
        13  [6]    d    a
        14  [6]    d    a
        15  [6]    d    a
        16  [6]    d    a
        17  [6]    d    a
        18  [6]    d    a
        19  [6]    d    a
        20  [6]    e    a

--> df.dtypes
        col1    object
        col2    object
        col3    object
        dtype: object

--> df.drop_duplicates(subset=['col1', 'col2', 'col3'], keep='last', inplace=True)
        
        ## TypeError: unhashable type: 'list' ##

## Here if I save it as an excel and load again, then this error does not come up ... ##

--> df.to_excel('test.xlsx')
--> df_ = pd.read_excel('test.xlsx')
--> df_.dtypes
        Unnamed: 0     int64
        col1    object
        col2    object
        col3    object
        dtype: object
--> df_.drop_duplicates(subset=['col1', 'col2', 'col3'], keep='last', inplace=True)
--> df_
         Unnamed: 0 col1 col2 col3
        8       8   [6]    c    a
        9       9   [1]    c    a
        11      11  [1]    d    a
        19      19  [6]    d    a
        20      20  [6]    e    a

Does this behaviour have an explanation ?

Extended Traceback of Issue

Traceback (most recent call last):

File "", line 1, in

File "C:\Users\Agnij\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4811, in drop_duplicates

duplicated = self.duplicated(subset, keep=keep)

File "C:\Users\Agnij\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4888, in duplicated labels, shape = map(list, zip(*map(f, vals)))

File "C:\Users\Agnij\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4863, in f vals, size_hint=min(len(self), _SIZE_HINT_LIMIT)

File "C:\Users\Agnij\Anaconda3\lib\site-packages\pandas\core\algorithms.py", line 636, in factorize values, na_sentinel=na_sentinel, size_hint=size_hint, na_value=na_value

File "C:\Users\Agnij\Anaconda3\lib\site-packages\pandas\core\algorithms.py", line 484, in _factorize_array uniques, codes = table.factorize(values, na_sentinel=na_sentinel, na_value=na_value)

File "pandas_libs\hashtable_class_helper.pxi", line 1815, in pandas._libs.hashtable.PyObjectHashTable.factorize

File "pandas_libs\hashtable_class_helper.pxi", line 1731, in pandas._libs.hashtable.PyObjectHashTable._unique

Upvotes: 1

Views: 1317

Answers (2)

mozway
mozway

Reputation: 260790

drop_duplicates hashes the objects to keep track of which ones have been seen or not, efficiently.

lists are not hashable (as they are mutable), thus you can't use drop_duplicates on them directly. When you save and load the data, chances are that it is converted to string, which enables the hash to be calculated.

To overcome the issue, you can convert the lists to tuples, that are hashable:

df['col1'] = df['col1'].apply(tuple)
# now this runs with no error
df.drop_duplicates(subset=['col1', 'col2', 'col3'], keep='last', inplace=True)

Upvotes: 3

user7864386
user7864386

Reputation:

Because even though both columns are dtype objects, the items in them are different types:

>>> df.loc[0,'col1']
[1]


>>> df_.loc[0, 'col1']
'[1]'

Since strings are hashable, you don't see the error that you had before with lists.

Upvotes: 1

Related Questions