Reputation: 581
The error -> TypeError: unhashable type: 'list' disappears after saving the data frame and loading it again ...
Both data frames [saved and loaded, generated] have the same dtypes ...
Reproducible ->
--> import pandas as pd
--> l1 = [[1], [1], [1], [1], [1], [1], [1], [1], [6], [1], [6], [1], [6], [6], [6], [6], [6], [6], [6], [6], [6]]
## len(l1) is 21 ##
--> l2 = ['a']*21
--> l3 = ['c']*10 + ['d']*10 + ['e']
--> df = pd.DataFrame()
--> df['col1'], df['col2'], df['col3'] = l1, l3, l2
--> df
col1 col2 col3
0 [1] c a
1 [1] c a
2 [1] c a
3 [1] c a
4 [1] c a
5 [1] c a
6 [1] c a
7 [1] c a
8 [6] c a
9 [1] c a
10 [6] d a
11 [1] d a
12 [6] d a
13 [6] d a
14 [6] d a
15 [6] d a
16 [6] d a
17 [6] d a
18 [6] d a
19 [6] d a
20 [6] e a
--> df.dtypes
col1 object
col2 object
col3 object
dtype: object
--> df.drop_duplicates(subset=['col1', 'col2', 'col3'], keep='last', inplace=True)
## TypeError: unhashable type: 'list' ##
## Here if I save it as an excel and load again, then this error does not come up ... ##
--> df.to_excel('test.xlsx')
--> df_ = pd.read_excel('test.xlsx')
--> df_.dtypes
Unnamed: 0 int64
col1 object
col2 object
col3 object
dtype: object
--> df_.drop_duplicates(subset=['col1', 'col2', 'col3'], keep='last', inplace=True)
--> df_
Unnamed: 0 col1 col2 col3
8 8 [6] c a
9 9 [1] c a
11 11 [1] d a
19 19 [6] d a
20 20 [6] e a
Does this behaviour have an explanation ?
Extended Traceback of Issue
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Agnij\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4811, in drop_duplicates
duplicated = self.duplicated(subset, keep=keep)
File "C:\Users\Agnij\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4888, in duplicated labels, shape = map(list, zip(*map(f, vals)))
File "C:\Users\Agnij\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4863, in f vals, size_hint=min(len(self), _SIZE_HINT_LIMIT)
File "C:\Users\Agnij\Anaconda3\lib\site-packages\pandas\core\algorithms.py", line 636, in factorize values, na_sentinel=na_sentinel, size_hint=size_hint, na_value=na_value
File "C:\Users\Agnij\Anaconda3\lib\site-packages\pandas\core\algorithms.py", line 484, in _factorize_array uniques, codes = table.factorize(values, na_sentinel=na_sentinel, na_value=na_value)
File "pandas_libs\hashtable_class_helper.pxi", line 1815, in pandas._libs.hashtable.PyObjectHashTable.factorize
File "pandas_libs\hashtable_class_helper.pxi", line 1731, in pandas._libs.hashtable.PyObjectHashTable._unique
Upvotes: 1
Views: 1317
Reputation: 260790
drop_duplicates
hashes the objects to keep track of which ones have been seen or not, efficiently.
list
s are not hashable (as they are mutable), thus you can't use drop_duplicates on them directly. When you save and load the data, chances are that it is converted to string, which enables the hash to be calculated.
To overcome the issue, you can convert the lists to tuples, that are hashable:
df['col1'] = df['col1'].apply(tuple)
# now this runs with no error
df.drop_duplicates(subset=['col1', 'col2', 'col3'], keep='last', inplace=True)
Upvotes: 3
Reputation:
Because even though both columns are dtype objects, the items in them are different types:
>>> df.loc[0,'col1']
[1]
>>> df_.loc[0, 'col1']
'[1]'
Since strings are hashable, you don't see the error that you had before with lists.
Upvotes: 1