Reputation: 11
I have a list of approx. 150 mineral names that don't quite match their equivalents in an approved list of several thousand mineral names; some of the mineral names in my list differ in some way from their approved equivalents (e.g. I may have an entry 'Amphibole(Barroisite)' rather than the accepted 'Barroisite').
I need a list that comprises the ~150 approved equivalent mineral names. I think the way to go about this is to use a list comprehension to generate a new list from partial matches between entries in the two lists but I just can't get anything to work. I have previously checked the likes of Partial String match between two lists in python but have had no luck.
Examples of entries from my list and the approved list below:
approved_list = ['Aegirine','Barroisite','Cuprite','Pyrope','Rosasite','Traskite','Vaesite']
my_list = ['Pyroxene(Aegirine)','Amphibole(Barroisite)','Cuprite','Garnet(Pyrope)', 'Rosasite']
In the above example I would ideally generate a list comprising Aegirine, Barroisite, Cuprite, Pyrope, and Rosasite. The solution would also need to be flexible (e.g. cant rely on position of brackets) as there are a number of differences between some entries.
Thanks in advance for any ideas!
Upvotes: 0
Views: 102
Reputation: 31379
It's hard to provide a complete answer with vague requirements. You'd have to specify more clearly what variations are possible.
But here is an example that ignores capitalisation, extra/missing diacritics (like umlaut - assuming the characters would be the same without diacritics, i.e. ä -> a
and not ä -> ae
), and whitespace:
import unicodedata
def strip_diacritics(s):
return ''.join(
# break down into characters after normalising:
c for c in unicodedata.normalize('NFD', s)
# not a non-spacing mark:
if unicodedata.category(c) != 'Mn'
)
approved_list = ['Aegirine', 'Barroisite', 'Cuprite', 'Pyrope', 'Rosasite', 'Traskite', 'Vaesite']
my_list = ['Pyroxene(Aegirine)', 'Amphibole(Barroïsite)', 'cuprite', 'Garnet (Pyrope)', 'Rosasite ']
# create a quick lookup from normalised name to desired name
approved_dict = {strip_diacritics(name).strip().lower(): name for name in approved_list}
new_list = [
next(name for key, name in approved_dict.items()
if key in strip_diacritics(test).strip().lower())
for test in my_list
]
print(new_list)
Note how I introduced some problems into my_list
and how that doesn't affect the outcome. Output:
['Aegirine', 'Barroisite', 'Cuprite', 'Pyrope', 'Rosasite']
Upvotes: 0