olintyler
olintyler

Reputation: 1

How to eliminate elements with certain characters or phrases from a list in python?

I have a list of plant names from an excel spreadsheet that I have extracted with pandas. After removing duplicates and making the entire list lower-case, I wanted to remove characters like parenthesis, apostrophes, dashes, and phrases like "A" and "The" to further eliminate any possible duplicates so that in a list like: ("A Pumpkin", "Pumpkin", "The Pumpkin", "Pump-kin", "(European) Pumpkin", "Pumpkin (Orange)", "Farmer's Pumpkin") only "Pumpkin" would remain. Note I don't want to remove just the characters from the string, but the entire string from the list.

def checkSyntax(str):
    boolean = None

    regexes = ["a ", "the ", "^\W"]
    combined = "(" + ")|(".join(regexes) + ")"
    match = re.match('combined', str)
    if match == None:
        boolean = True

    return boolean

def elimInvalidNames(names):
    new_names = [s for s in names if checkSyntax(s)]
    return new_names 

test_list = ("A Pumpkin", "Pumpkin", "The Pumpkin", "Pump-kin", "(European) Pumpkin", "Pumpkin (Orange)", "Farmer's Pumpkin")
test_list = list(map(lambda x: x.lower(), test_list))
test_list = elimInvalidNames(test_list)
print(test_list)

For some reason this gets rid of "the" and "a" but not parenthesis, dashes, or apostrophes.

Upvotes: 0

Views: 77

Answers (3)

a'r
a'r

Reputation: 36989

If you want to use regular expressions, then use re.search as this will attempt to match the expression at any part of the string. The re.match function only attempts to match the expression at the start of the string.

For example, the following code filters the list to ['Pumpkin']:

import re
invalid_names_re = re.compile(r"(A )|(The )|[()\-']", re.IGNORECASE)                                                                                                                                                            names = ("A Pumpkin", "Pumpkin", "The Pumpkin", "Pump-kin", "(European) Pumpkin", "Pumpkin (Orange)", "Farmer's Pumpkin")  

filtered_names = [name for name in names if not invalid_names_re.search(name)]

Upvotes: 0

Alysson Bruno
Alysson Bruno

Reputation: 34

If regex is not mandatory be used, try it:

names = ("A Pumpkin", "Pumpkin", "The Pumpkin", "Pump-kin", "(European) Pumpkin", "Pumpkin (Orange)", "Farmer's Pumpkin")
def checkSyntax(s):
    regexes = ["a ", "the ",'"', "'", '\\', '.', ',', ')', '(', '-']
    return not any(letra in s.lower() for letra in regexes)

def eIn(names):
    new_names = [s for s in names if checkSyntax(s)]
    return new_names

Upvotes: 0

Red
Red

Reputation: 27547

This should do it:

import re
new = []
test_list = ("A Pumpkin", "Pumpkin", "The Pumpkin", "Pump-kin", "(European) Pumpkin", "Pumpkin (Orange)", "Farmer's Pumpkin")
for s in test_list:
    for n in s.split():
        if n == re.sub(r'[^\w\s]','',n) and n.lower() != 'a' and n.lower() != 'the': # Adds word to new list if word is not 'a', 'the', and doesn't contain punctuations 
            new.append(n)
print(list(set(new))) # Convert to a set to remove duplicates, then back to alis

Output:

['Pumpkin']

Upvotes: 1

Related Questions