Reputation: 664
I have a list of paths that look like this (see below). As you can see, file-naming is inconsistent, but I would like to keep only one file per person. I already have a function that removes duplicates if they have the exact same file name but different file extensions, however, with this inconsistent file-naming case it seems trickier.
The list of files looks something like this (but assume there are thousands of paths and words that aren't part of the full names e.g. cv, curriculum vitae etc.):
all_files =
['cv_bob_johnson.pdf',
'bob_johnson_cv.pdf',
'curriculum_vitae_bob_johnson.pdf',
'cv_lara_kroft_cv.pdf',
'cv_lara_kroft.pdf' ]
Desired output:
unique_files = ['cv_bob_johnson.pdf', 'cv_lara_kroft.pdf']
Given that the names are somewhat in a written pattern most of the time (e.g. first name precedes last name), I assume there has to be a way of getting a unique set of the paths if the names are repeated?
Upvotes: 0
Views: 42
Reputation: 856
If you want to keep your algorithm relatively simple (i.e., not using ML etc), you'll need to have some idea about the typical substrings that you want to remove. Let's make a list of such substrings, for example:
remove = ['cv_', '_cv', 'curriculum_vitae_', '_curriculum_vitae']
Then you can process your list of files this way:
import re
all_files = ['cv_bob_johnson.pdf', 'bob_johnson_cv.pdf', 'curriculum_vitae_bob_johnson.pdf', 'cv_lara_kroft_cv.pdf', 'cv_lara_kroft.pdf']
remove = ['cv_', '_cv', 'curriculum_vitae_', '_curriculum_vitae']
unique = []
for file in all_files:
# strip a suffix, if any:
try:
name, suffix = file.rsplit('.', 1)
except:
name, suffix = file, None
# remove the excess parts:
for rem in remove:
name = re.sub(rem, '', name)
# append the result to the list:
unique.append(f'{name}.{suffix}' if suffix else name)
# remove duplicates:
unique = list(set(unique))
print(unique)
Upvotes: 1