Reputation: 666
To clean a dataset, I need to split a string after the last digit. Any idea ?
My dataframe:
data = {'addr':[
"510 -1, Cleveland St",
"RC-20-5345 Poplar Street",
"3600 Race Avenue Richardson"]}
df = pd.DataFrame(data)
addr
_____________________________________
510 -1, Cleveland St
RC-20-5345 Poplar Street
3600 Race Avenue Richardson
I tried with this expression, but I missed floor number (RC) in the second row.
df["split1"] = df["addr"].str.extract(r"(\d+[-\ ]+\d*)")
split1 | split2
___________|_________________________
510 -1 | , Cleveland St
20-5345 | Poplar Street
3600 | Race Avenue Richardson
What I m looking for:
split1 | split2
___________|_________________________
510 -1 | , Cleveland St
RC-20-5345 | Poplar Street
3600 | Race Avenue Richardson
Upvotes: 1
Views: 286
Reputation: 165
def splitByLastDigit(x):
lastDigit=0
splitOne=""
splitTwo=""
finalArray=[]
for i in range(0,len(x)):
if x[i].isdigit() and i > lastDigit:
lastDigit=i
for i in range(0,len(x)):
if i <= lastDigit:
splitOne+=x[i]
else:
splitTwo+=x[i]
if len(splitTwo.strip()) == 1 and splitTwo.strip().isalpha():
return [splitOne+splitTwo]
finalArray.append(splitOne)
finalArray.append(splitTwo)
return finalArray
Just wrote up this solution. It is a bit rough (can definitely be done more elegant) but tested it with the three examples you provided and gets the job done.
Pretty simple idea. Collects the index of the last digit, then another loop checks which characters are before and after that index. Lastly, appends to it an array and returns the final results.
Upvotes: 1
Reputation: 23281
To piggyback on zyd's answer, capture the remainder in another group
data = {'addr':[
"510 -1, Cleveland St",
"RC-20-5345 Poplar Street",
"3600 Race Avenue Richardson"]}
df = pd.DataFrame(data)
df[['split1','split2']] = df["addr"].str.extract(r"(.*\d+[-\ ]+\d*)(.+)")
addr split1 split2
0 510 -1, Cleveland St 510 -1 , Cleveland St
1 RC-20-5345 Poplar Street RC-20-5345 Poplar Street
2 3600 Race Avenue Richardson 3600 Race Avenue Richardson
Upvotes: 1
Reputation: 925
what about just adding a wildcard match to the front of the regex?
df["split1"] = df["addr"].str.extract(r"(.*\d+[-\ ]+\d*)")
Upvotes: 2