MLFA
MLFA

Reputation: 21

Selectively Delete White Space After String in Python

I have the following snippet of code:

    import pandas as pd

    df = pd.DataFrame([{'LastName':'VAN HOUTEN'},
                       {'LastName':"O'BOYLE"},
                       {'LastName':'ESTEVAN-GONZALEZ'},
                       {'LastName':'RODRIGO TEIXEIRA'},
                       {'LastName':'ESTEBAN GONZALEZ'}, 
                       {'LastName':'O ROURKE'},
                       {'LastName':'RODRIGO-TEIXEIRA'}])

    delete_space_after_list = ['VAN','O']

    df['NewName'] = df['LastName'].str.replace("'"," ")

    for s in delete_space_after_list[:]:
        df['NewName'] = df['NewName'].str.replace(s + ' ', s)

    df['NewName'] = df['NewName'].str.replace('-'," ")
    df['NewName'] = df['NewName'].str.split().str.get(0)   

Running this code gives me the following result:

    Index        LastName               NewName
    0            VAN HOUTEN             VANHOUTEN
    1            O'BOYLE                OBOYLE
    2            ESTEVAN-GONZALEZ       ESTEVAN
    3            RODRIGO TEIXEIRA       RODRIGOTEIXEIRA
    4            ESTEVAN GONZALEZ       ESTEVANGONZALEZ
    5            O ROURKE               OROURKE
    6            RODRIGO-TEIXEIRA       RODRIGO

However the desired output is this

    Index        LastName               DesiredName
    0            VAN HOUTEN             VANHOUTEN
    1            O'BOYLE                OBOYLE
    2            ESTEVAN-GONZALEZ       ESTEVAN
    3            RODRIGO TEIXEIRA       RODRIGO
    4            ESTEVAN GONZALEZ       ESTEVAN
    5            O ROURKE               OROURKE
    6            RODRIGO-TEIXEIRA       RODRIGO

It is eliminating the space after RODRIGO (because of the 'O' at the end of LastName) and concatenating it with 'TEIXEIRA' and similarly eliminating the space after ESTEVAN (because of the 'VAN' at the end of LastName) and concatenating it with 'GONZALEZ'. However, it correctly eliminates the space in other names.

How can I get this code to correctly delete the white space as it does for VAN HOUTEN, O'BOYLE, ESTEVAN-GONZALEZ, O ROURKE, & RODRIGO-TEIXEIRA while not deleting the white space after ESTEVAN GONZALEZ & RODRIGO TEIXEIRA?

Upvotes: 2

Views: 96

Answers (2)

cs95
cs95

Reputation: 402533

A pandas solution, the regex isn't as clean as Jean-François Fabre's, but it works.

In [541]: import operator

In [542]:  df['LastName'].transform(lambda x: x.replace("[-']", ' ', regex=True) \
     ...:                                     .replace(r'^((?:O)|(?:VAN)) ', r'\1', regex=True) \
     ...:                                     .str.split()) \
     ...:                .map(operator.itemgetter(0))
     ...: 
Out[546]: 
0    VANHOUTEN
1       OBOYLE
2      ESTEVAN
3      RODRIGO
4      ESTEBAN
5      OROURKE
6      RODRIGO
Name: LastName, dtype: object
  1. replace("[-']", ' ', regex=True) converts all hyphens and colons to spaces.

  2. replace(r'^((?:O)|(?:VAN)) ', r'\1', regex=True) removes the space after a starting 'O' or 'VAN'.

  3. str.split() splits on space

Upvotes: 1

Jean-François Fabre
Jean-François Fabre

Reputation: 140188

So you want to remove the "less significant" name, which is defined by the one following a name ending with O or VAN but not being O or VAN, and also remove the non-letters for other names.

That's a job for regular expressions (or a long, painful job without them)

I would do that by chaining 2 regular expressions like this (I left pandas out of this, as the problem has no direct relation with pandas):

data = [{'LastName':'VAN HOUTEN'},
                       {'LastName':"O'BOYLE"},
                       {'LastName':'ESTEVAN-GONZALEZ'},
                       {'LastName':'RODRIGO TEIXEIRA'},
                       {'LastName':'ESTEVAN GONZALEZ'}, # not ESTEBAN as in your example!
                       {'LastName':'O ROURKE'},
                       {'LastName':'RODRIGO-TEIXEIRA'}]

import re

new_data = [re.sub("\W","",re.sub("(.)(O|VAN)\W.*",r"\1\2",v['LastName'])) for v in data]

print(new_data)

result:

['VANHOUTEN', 'OBOYLE', 'ESTEVAN', 'RODRIGO', 'ESTEVAN', 'OROURKE', 'RODRIGO']

so:

  • "(.)(O|VAN)\W.*" matches at least one character followed by the O and VAN prefixes, followed by a non-letter (\W) and the rest, that we skip (we keep only the 2 first groups): that handles the "less significant names"
  • "\W" deletes spaces, dashes, quotes... all non-alphanumerical. Than handles the second case.

Upvotes: 2

Related Questions