wsdzbm
wsdzbm

Reputation: 3670

String compare in python

I'm looking for duplicate files by compare the filenames.

However, I found some paths returned by os.walk contain escaped chars. For example, I may get structure in the Earth\'s core.pdf for one file and structure in the Earth\xe2\x80\x99s core.pdf for another.

In [1]: print 'structure in the Earth\'s core.pdf\nstructure in the Earth\xe2\x80\x99s core.pdf'
structure in the Earth's core.pdf
structure in the Earth’s core.pdf

In [2]: 'structure in the Earth\'s core.pdf' == 'structure in the Earth\xe2\x80\x99s core.pdf'
Out[2]: False

How do I deal with these cases?

==== Just to clarify the Q in response to the comments, there are also other situations for duplicate files like

Upvotes: 1

Views: 214

Answers (1)

Arthur Gouveia
Arthur Gouveia

Reputation: 744

Maybe you can get the similarity of the strings instead of an exact match. Get the exact match can be problematic because of simple things like capitalization.

I suggest the following:

from difflib import SequenceMatcher

s1 = "structure in the Earth\'s core.pdf"
s2 = "structure in the Earth\xe2\x80\x99s core.pdf"

matcher = SequenceMatcher()
matcher.set_seqs(s1, s2)
print(matcher.ratio())
# 0.9411764705882353

This result shows that the similarity between both strings is over 94%. You could define a threshold to delete or to review the items before deletion.

Upvotes: 1

Related Questions