Reputation: 1
I have a list of plant names from an excel spreadsheet that I have extracted with pandas. After removing duplicates and making the entire list lower-case, I wanted to remove characters like parenthesis, apostrophes, dashes, and phrases like "A" and "The" to further eliminate any possible duplicates so that in a list like: ("A Pumpkin", "Pumpkin", "The Pumpkin", "Pump-kin", "(European) Pumpkin", "Pumpkin (Orange)", "Farmer's Pumpkin")
only "Pumpkin" would remain. Note I don't want to remove just the characters from the string, but the entire string from the list.
def checkSyntax(str):
boolean = None
regexes = ["a ", "the ", "^\W"]
combined = "(" + ")|(".join(regexes) + ")"
match = re.match('combined', str)
if match == None:
boolean = True
return boolean
def elimInvalidNames(names):
new_names = [s for s in names if checkSyntax(s)]
return new_names
test_list = ("A Pumpkin", "Pumpkin", "The Pumpkin", "Pump-kin", "(European) Pumpkin", "Pumpkin (Orange)", "Farmer's Pumpkin")
test_list = list(map(lambda x: x.lower(), test_list))
test_list = elimInvalidNames(test_list)
print(test_list)
For some reason this gets rid of "the" and "a" but not parenthesis, dashes, or apostrophes.
Upvotes: 0
Views: 77
Reputation: 36989
If you want to use regular expressions, then use re.search
as this will attempt to match the expression at any part of the string. The re.match
function only attempts to match the expression at the start of the string.
For example, the following code filters the list to ['Pumpkin']
:
import re
invalid_names_re = re.compile(r"(A )|(The )|[()\-']", re.IGNORECASE) names = ("A Pumpkin", "Pumpkin", "The Pumpkin", "Pump-kin", "(European) Pumpkin", "Pumpkin (Orange)", "Farmer's Pumpkin")
filtered_names = [name for name in names if not invalid_names_re.search(name)]
Upvotes: 0
Reputation: 34
If regex is not mandatory be used, try it:
names = ("A Pumpkin", "Pumpkin", "The Pumpkin", "Pump-kin", "(European) Pumpkin", "Pumpkin (Orange)", "Farmer's Pumpkin")
def checkSyntax(s):
regexes = ["a ", "the ",'"', "'", '\\', '.', ',', ')', '(', '-']
return not any(letra in s.lower() for letra in regexes)
def eIn(names):
new_names = [s for s in names if checkSyntax(s)]
return new_names
Upvotes: 0
Reputation: 27547
This should do it:
import re
new = []
test_list = ("A Pumpkin", "Pumpkin", "The Pumpkin", "Pump-kin", "(European) Pumpkin", "Pumpkin (Orange)", "Farmer's Pumpkin")
for s in test_list:
for n in s.split():
if n == re.sub(r'[^\w\s]','',n) and n.lower() != 'a' and n.lower() != 'the': # Adds word to new list if word is not 'a', 'the', and doesn't contain punctuations
new.append(n)
print(list(set(new))) # Convert to a set to remove duplicates, then back to alis
Output:
['Pumpkin']
Upvotes: 1