Reputation: 463
I am trying to find the closest match to an approximate movie title given an actual movie title using the max function and its key argument. If I define a sample list and test the function it works...
from difflib import SequenceMatcher as SM
movies = ['fake movie title', 'faker movie title', 'shaun died']
approx_title = 'Shaun of the Dead.'
max(movies, key = lambda title: SM(None, approx_title, title).ratio())
'shaun died'
But I am trying to match to an entire column in a separate dataframe, so I tried converting that Pandas Series to a list and running the same function, but instead I get a type_error, even though I've checked the datatype of both movies & movie_lst are lists.
Old id New id Title Year Critics Score Audience Score Rating
NaN 21736.0 Peter Pan 1999.0 NaN 70.0 PG nothing objectionable
NaN 771471359.0 Dragonheart Battle for the Heartfire 2017.0 NaN 50.0 PG13
NaN 770725090.0 The Nude Vampire Vampire nue, La 1974.0 NaN 24.0 NR
2281.0 19887.0 Beyond the Clouds 1995.0 65.0 67.0 NR
10913.0 11286.0 Wild America 1997.0 27.0 59.0 PG violence
movie_lst = rt_info['Title'].tolist()
['Peter Pan',
'Dragonheart Battle for the Heartfire',
'The Nude Vampire Vampire nue, La',
'Beyond the Clouds',
'Wild America',
'Sexual Dependency',
'Body Slam',
'Hatchet II',
'Lion of the Desert Omar Mukhtar',
'Imagine That',
'Harold',
'A United Kingdom',
'Violent City The FamilyCitt violenta',
'Ratchet Clank',
'Wes Craven Presents Carnival of Souls',
'The Adventures of Ociee Nash',
'Blackfish',
'For Petes Sake',
'Daybreakers',
'The Big One',
'Godzilla vs Megaguirus',
'In a Lonely Place',
'Case 39', ...
]
max(movie_lst, key = lambda title: SM(None, approx_title, title).ratio())
TypeError Traceback (most recent call last)
<ipython-input-88-0022a3c1bdb9> in <module>()
----> 1 max(movie_lst, key = lambda title: SM(None, approx_title, title).ratio())
<ipython-input-88-0022a3c1bdb9> in <lambda>(title)
----> 1 max(movie_lst, key = lambda title: SM(None, approx_title, title).ratio())
/usr/lib/python3.4/difflib.py in __init__(self, isjunk, a, b, autojunk)
211 self.a = self.b = None
212 self.autojunk = autojunk
--> 213 self.set_seqs(a, b)
214
215 def set_seqs(self, a, b):
/usr/lib/python3.4/difflib.py in set_seqs(self, a, b)
223
224 self.set_seq1(a)
--> 225 self.set_seq2(b)
226
227 def set_seq1(self, a):
/usr/lib/python3.4/difflib.py in set_seq2(self, b)
277 self.matching_blocks = self.opcodes = None
278 self.fullbcount = None
--> 279 self.__chain_b()
280
281 # For each element x in b, set b2j[x] to a list of the indices in
/usr/lib/python3.4/difflib.py in __chain_b(self)
309 self.b2j = b2j = {}
310
--> 311 for i, elt in enumerate(b):
312 indices = b2j.setdefault(elt, [])
313 indices.append(i)
TypeError: 'float' object is not iterable
I'm stumped as to why - any help would be appreciated!
Upvotes: 1
Views: 3200
Reputation: 140276
Not a pandas expert and cannot reproduce but depending on how the file is read, since there are titles (like the french movie 11.6
for instance) which match a float, it's possible that some data are float
s instead of strings (well your issue proves that it is possible :))
A good workaround would be to force data as string like this:
movie_lst = [str(x) for x in movie_lst]
It doesn't create copies of the strings if they are already strings (Should I avoid converting to a string if a value is already a string?) so it's efficient, and you are sure to get only strings.
note that you can find the offenders by printing:
[x for x in movie_lst if not isinstance(x,str)]
Upvotes: 2