Adil Blanco
Adil Blanco

Reputation: 666

How to split string after the last digit

To clean a dataset, I need to split a string after the last digit. Any idea ?

My dataframe:

data = {'addr':[
         "510 -1, Cleveland St", 
         "RC-20-5345 Poplar Street", 
         "3600 Race Avenue Richardson"]}

df = pd.DataFrame(data)

   addr
_____________________________________
   510 -1, Cleveland St
   RC-20-5345 Poplar Street
   3600 Race Avenue Richardson

I tried with this expression, but I missed floor number (RC) in the second row.

df["split1"] = df["addr"].str.extract(r"(\d+[-\ ]+\d*)")

  split1   | split2
___________|_________________________
510 -1     |  , Cleveland St
20-5345    |  Poplar Street
3600       |  Race Avenue Richardson

What I m looking for:

  split1   | split2
___________|_________________________
510 -1     |  , Cleveland St
RC-20-5345 |  Poplar Street
3600       |  Race Avenue Richardson

Upvotes: 1

Views: 286

Answers (3)

HARRIBO
HARRIBO

Reputation: 165

def splitByLastDigit(x):
    lastDigit=0
    splitOne=""
    splitTwo=""
    finalArray=[]
    for i in range(0,len(x)):
        if x[i].isdigit() and i > lastDigit:
            lastDigit=i

    for i in range(0,len(x)):
        if i <= lastDigit:
            splitOne+=x[i]
        else:
            splitTwo+=x[i]
    if len(splitTwo.strip()) == 1 and splitTwo.strip().isalpha():
        return [splitOne+splitTwo]
    finalArray.append(splitOne)
    finalArray.append(splitTwo)
    return finalArray

Just wrote up this solution. It is a bit rough (can definitely be done more elegant) but tested it with the three examples you provided and gets the job done.

Pretty simple idea. Collects the index of the last digit, then another loop checks which characters are before and after that index. Lastly, appends to it an array and returns the final results.

Upvotes: 1

cottontail
cottontail

Reputation: 23281

To piggyback on zyd's answer, capture the remainder in another group

data = {'addr':[
         "510 -1, Cleveland St", 
         "RC-20-5345 Poplar Street", 
         "3600 Race Avenue Richardson"]}

df = pd.DataFrame(data)
df[['split1','split2']] = df["addr"].str.extract(r"(.*\d+[-\ ]+\d*)(.+)")
                          addr       split1                  split2
0         510 -1, Cleveland St       510 -1          , Cleveland St
1     RC-20-5345 Poplar Street  RC-20-5345            Poplar Street
2  3600 Race Avenue Richardson        3600   Race Avenue Richardson

Upvotes: 1

zyd
zyd

Reputation: 925

what about just adding a wildcard match to the front of the regex?

df["split1"] = df["addr"].str.extract(r"(.*\d+[-\ ]+\d*)")

Upvotes: 2

Related Questions