geolguy
geolguy

Reputation: 11

Generating a list with a list comprehension using partial string matches between two lists

I have a list of approx. 150 mineral names that don't quite match their equivalents in an approved list of several thousand mineral names; some of the mineral names in my list differ in some way from their approved equivalents (e.g. I may have an entry 'Amphibole(Barroisite)' rather than the accepted 'Barroisite').

I need a list that comprises the ~150 approved equivalent mineral names. I think the way to go about this is to use a list comprehension to generate a new list from partial matches between entries in the two lists but I just can't get anything to work. I have previously checked the likes of Partial String match between two lists in python but have had no luck.

Examples of entries from my list and the approved list below:

approved_list = ['Aegirine','Barroisite','Cuprite','Pyrope','Rosasite','Traskite','Vaesite']

my_list = ['Pyroxene(Aegirine)','Amphibole(Barroisite)','Cuprite','Garnet(Pyrope)', 'Rosasite']

In the above example I would ideally generate a list comprising Aegirine, Barroisite, Cuprite, Pyrope, and Rosasite. The solution would also need to be flexible (e.g. cant rely on position of brackets) as there are a number of differences between some entries.

Thanks in advance for any ideas!

Upvotes: 0

Views: 102

Answers (1)

Grismar
Grismar

Reputation: 31379

It's hard to provide a complete answer with vague requirements. You'd have to specify more clearly what variations are possible.

But here is an example that ignores capitalisation, extra/missing diacritics (like umlaut - assuming the characters would be the same without diacritics, i.e. ä -> a and not ä -> ae), and whitespace:

import unicodedata


def strip_diacritics(s):
    return ''.join(
        # break down into characters after normalising:
        c for c in unicodedata.normalize('NFD', s)  
        # not a non-spacing mark:
        if unicodedata.category(c) != 'Mn'  
    )


approved_list = ['Aegirine', 'Barroisite', 'Cuprite', 'Pyrope', 'Rosasite', 'Traskite', 'Vaesite']

my_list = ['Pyroxene(Aegirine)', 'Amphibole(Barroïsite)', 'cuprite', 'Garnet (Pyrope)', 'Rosasite ']

# create a quick lookup from normalised name to desired name
approved_dict = {strip_diacritics(name).strip().lower(): name for name in approved_list}

new_list = [
    next(name for key, name in approved_dict.items()
         if key in strip_diacritics(test).strip().lower())
    for test in my_list
]

print(new_list)

Note how I introduced some problems into my_list and how that doesn't affect the outcome. Output:

['Aegirine', 'Barroisite', 'Cuprite', 'Pyrope', 'Rosasite']

Upvotes: 0

Related Questions