Reputation: 3670
I'm looking for duplicate files by compare the filenames.
However, I found some paths returned by os.walk
contain escaped chars. For example, I may get structure in the Earth\'s core.pdf
for one file and structure in the Earth\xe2\x80\x99s core.pdf
for another.
In [1]: print 'structure in the Earth\'s core.pdf\nstructure in the Earth\xe2\x80\x99s core.pdf'
structure in the Earth's core.pdf
structure in the Earth’s core.pdf
In [2]: 'structure in the Earth\'s core.pdf' == 'structure in the Earth\xe2\x80\x99s core.pdf'
Out[2]: False
How do I deal with these cases?
==== Just to clarify the Q in response to the comments, there are also other situations for duplicate files like
-
while the other by :
Upvotes: 1
Views: 214
Reputation: 744
Maybe you can get the similarity of the strings instead of an exact match. Get the exact match can be problematic because of simple things like capitalization.
I suggest the following:
from difflib import SequenceMatcher
s1 = "structure in the Earth\'s core.pdf"
s2 = "structure in the Earth\xe2\x80\x99s core.pdf"
matcher = SequenceMatcher()
matcher.set_seqs(s1, s2)
print(matcher.ratio())
# 0.9411764705882353
This result shows that the similarity between both strings is over 94%. You could define a threshold to delete or to review the items before deletion.
Upvotes: 1