Reputation: 21
I have the following snippet of code:
import pandas as pd
df = pd.DataFrame([{'LastName':'VAN HOUTEN'},
{'LastName':"O'BOYLE"},
{'LastName':'ESTEVAN-GONZALEZ'},
{'LastName':'RODRIGO TEIXEIRA'},
{'LastName':'ESTEBAN GONZALEZ'},
{'LastName':'O ROURKE'},
{'LastName':'RODRIGO-TEIXEIRA'}])
delete_space_after_list = ['VAN','O']
df['NewName'] = df['LastName'].str.replace("'"," ")
for s in delete_space_after_list[:]:
df['NewName'] = df['NewName'].str.replace(s + ' ', s)
df['NewName'] = df['NewName'].str.replace('-'," ")
df['NewName'] = df['NewName'].str.split().str.get(0)
Running this code gives me the following result:
Index LastName NewName
0 VAN HOUTEN VANHOUTEN
1 O'BOYLE OBOYLE
2 ESTEVAN-GONZALEZ ESTEVAN
3 RODRIGO TEIXEIRA RODRIGOTEIXEIRA
4 ESTEVAN GONZALEZ ESTEVANGONZALEZ
5 O ROURKE OROURKE
6 RODRIGO-TEIXEIRA RODRIGO
However the desired output is this
Index LastName DesiredName
0 VAN HOUTEN VANHOUTEN
1 O'BOYLE OBOYLE
2 ESTEVAN-GONZALEZ ESTEVAN
3 RODRIGO TEIXEIRA RODRIGO
4 ESTEVAN GONZALEZ ESTEVAN
5 O ROURKE OROURKE
6 RODRIGO-TEIXEIRA RODRIGO
It is eliminating the space after RODRIGO (because of the 'O' at the end of LastName) and concatenating it with 'TEIXEIRA' and similarly eliminating the space after ESTEVAN (because of the 'VAN' at the end of LastName
) and concatenating it with 'GONZALEZ'. However, it correctly eliminates the space in other names.
How can I get this code to correctly delete the white space as it does for VAN HOUTEN, O'BOYLE, ESTEVAN-GONZALEZ, O ROURKE, & RODRIGO-TEIXEIRA while not deleting the white space after ESTEVAN GONZALEZ & RODRIGO TEIXEIRA?
Upvotes: 2
Views: 96
Reputation: 402533
A pandas solution, the regex isn't as clean as Jean-François Fabre's, but it works.
In [541]: import operator
In [542]: df['LastName'].transform(lambda x: x.replace("[-']", ' ', regex=True) \
...: .replace(r'^((?:O)|(?:VAN)) ', r'\1', regex=True) \
...: .str.split()) \
...: .map(operator.itemgetter(0))
...:
Out[546]:
0 VANHOUTEN
1 OBOYLE
2 ESTEVAN
3 RODRIGO
4 ESTEBAN
5 OROURKE
6 RODRIGO
Name: LastName, dtype: object
replace("[-']", ' ', regex=True)
converts all hyphens and colons to spaces.
replace(r'^((?:O)|(?:VAN)) ', r'\1', regex=True)
removes the space after a starting 'O'
or 'VAN'
.
str.split()
splits on space
Upvotes: 1
Reputation: 140188
So you want to remove the "less significant" name, which is defined by the one following a name ending with O
or VAN
but not being O
or VAN
, and also remove the non-letters for other names.
That's a job for regular expressions (or a long, painful job without them)
I would do that by chaining 2 regular expressions like this (I left pandas out of this, as the problem has no direct relation with pandas):
data = [{'LastName':'VAN HOUTEN'},
{'LastName':"O'BOYLE"},
{'LastName':'ESTEVAN-GONZALEZ'},
{'LastName':'RODRIGO TEIXEIRA'},
{'LastName':'ESTEVAN GONZALEZ'}, # not ESTEBAN as in your example!
{'LastName':'O ROURKE'},
{'LastName':'RODRIGO-TEIXEIRA'}]
import re
new_data = [re.sub("\W","",re.sub("(.)(O|VAN)\W.*",r"\1\2",v['LastName'])) for v in data]
print(new_data)
result:
['VANHOUTEN', 'OBOYLE', 'ESTEVAN', 'RODRIGO', 'ESTEVAN', 'OROURKE', 'RODRIGO']
so:
"(.)(O|VAN)\W.*"
matches at least one character followed by the O
and VAN
prefixes, followed by a non-letter (\W
) and the rest, that we skip (we keep only the 2 first groups): that handles the "less significant names""\W"
deletes spaces, dashes, quotes... all non-alphanumerical. Than handles the second case.Upvotes: 2