Reputation: 6615
import pandas as pd
import re
df = pd.DataFrame({'fix_this_field':['dogstreet 1234, st, texas 57500', 'animal hospital of dallas, 233 medical ln '], 'needed solution':['1234, st texas 57500', '233 medical ln']})
df #look what i want
I want to extract all of the data after the first number, including the number. See solution column in dataframe. So something like 'hospital2019 lane' would become '2019 lane'.
I have tried looking something along the lines of what is below but I am struggling and banging head against the wall. Please let me know error of my ways.
x = 'hospital2019 lane'
r = re.compile("^([a-zA-Z]+)([0-9]+)")
m = r.match(x)
m.groups()
# it stops at 2019. I want 2019 lane.....('hospital', '2019')
Upvotes: 3
Views: 67
Reputation: 3591
I found df.fix_this_field.apply(lambda x: x[re.search("\d",x).start():])
and df.fix_this_field.apply(lambda x: ''.join(re.split('(\d)',x,1)[1:]))
to be several times as fast as df.fix_this_field.str.split('(\d)',1).str[1:].apply(''.join)
.
Upvotes: 0
Reputation: 3752
If you must use regex, below is an attempt:
(?:[a-zA-Z ])([0-9]+.*)
reg = re.compile('(?:[a-zA-Z ,])([0-9]+.*)')
def clean(col):
return re.findall(reg, col)[0] if re.findall(reg, col) else None
df.fix_this_field.apply(clean)
Out[1]:
0 1234, st, texas 57500
1 233 medical ln
Name: fix_this_field, dtype: object
Upvotes: 1
Reputation: 323226
Easy to achieve by using split
df.fix_this_field.str.split('(\d)',1).str[1:].apply(''.join)
Out[475]:
0 1234, st, texas 57500
1 233 medical ln
Name: fix_this_field, dtype: object
df['col']=df.fix_this_field.str.split('(\d)',1).str[1:].apply(''.join)
Upvotes: 3