Reputation: 469
I have a data frame with 4 columns, col4 is a string including texts and digits:
Col1 Col2 Col3 Col4
Syslog 2016,09,17 1 PD380_003 %LINK-3-UPDOWN
Syslog 2016,09,17 1 NM380_005 %BGP-5-NBR_RESET
Syslog 2016,09,14 1 NM380_005 %BGP-5-NBR_RESET
Syslog 2016,09,08 1 DO NOT TICKET LO380_004 %SYS-5-CONFIG_I Config
i need to keep a substring of that column and delete anything else, so I used regex and i made a pattern but when i run the following query the result is not what i want, it replace everything with the pattern itself:
data.replace({'Col4':{'.*':'([A-Z]{2}[0-9]{3}_[0-9]{3})'}},regex=True)
desired result is:
Col1 Col2 Col3 Col4
Syslog 2016,09,17 1 PD380_003
Syslog 2016,09,17 1 NM380_005
Syslog 2016,09,14 1 LO380_004
Syslog 2016,09,08 1 LO380_004
but the result i get is like:
Col1 Col2 Col3 Col4
Syslog 2016,09,17 1 ([A-Z]{2}[0-9]{3}_[0-9]{3})
Syslog 2016,09,17 1 ([A-Z]{2}[0-9]{3}_[0-9]{3})
Syslog 2016,09,14 1 ([A-Z]{2}[0-9]{3}_[0-9]{3})
Syslog 2016,09,08 1 ([A-Z]{2}[0-9]{3}_[0-9]{3})
what am i doing wrong?
Upvotes: 3
Views: 6233
Reputation: 104111
First, you have the wrong regex's in the wrong positions. The to_replace
argument to .replace needs to match what to replace and what to delete. So you need a ^.*
in front of and a .*$
behind your regex in this case since you want to trim the string outside the match:
^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$
Second, the replace
argument, if a regex, needs to be a capturing group or fixed string. In this case \1
will do.
Last, the Series form of .replace
has a littler simpler syntax (at least for me) to understand.
So given:
>>> df
Col1 Col2 Col3 Col4
0 SysLog 2016,09,17 1 PD380_003 %LINK-3-UPDOWN
1 SysLog 2016,09,17 1 NM380_005 %BGP-5-NBR_RESET
2 SysLog 2016,09,17 1 NM380_005 %BGP-5-NBR_RESET
3 SysLog 2016,09,17 1 DO NOT TICKET LO380_004 %SYS-5-CONFIG_I Config
You can do:
>>> df['Col4'].replace(to_replace='^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$', value=r'\1', regex=True)
0 PD380_003
1 NM380_005
2 NM380_005
3 LO380_004
Name: Col4, dtype: object
You can also use a positional argument version if easier:
df['Col4'].replace('^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$', r'\1', regex=True)
but you need to have regex=True
since the replacement string is to be interpreted as a regex -- not just a static string.
Finally, assign directly into the original:
>>> df['Col4']=df['Col4'].replace('^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$', r'\1', regex=True)
>>> df
Col1 Col2 Col3 Col4
0 SysLog 2016,09,17 1 PD380_003
1 SysLog 2016,09,17 1 NM380_005
2 SysLog 2016,09,17 1 NM380_005
3 SysLog 2016,09,17 1 LO380_004
Upvotes: 3
Reputation: 210982
You were using RegEx in a wrong way.
{'Col4':{'.*':'([A-Z]{2}[0-9]{3}_[0-9]{3})'}}
- means replace a whatever string in the Col4
column with '([A-Z]{2}[0-9]{3}_[0-9]{3})'
Try this:
In [87]: df.replace({'Col4':{r'.*?([A-Z]{2}[0-9]{3}_[0-9]{3}).*':r'\1'}}, regex=True)
Out[87]:
Col1 Col2 Col3 Col4
0 Syslog 2016,09,17 1 PD380_003
1 Syslog 2016,09,17 1 NM380_005
2 Syslog 2016,09,14 1 NM380_005
3 Syslog 2016,09,08 1 LO380_004
Upvotes: 1
Reputation: 863791
I think you need extract
:
data.Col4 = data.Col4.str.extract('([A-Z]{2}[0-9]{3}_[0-9]{3})', expand=False)
print (data)
Col1 Col2 Col3 Col4
0 Syslog 2016,09,17 1 PD380_003
1 Syslog 2016,09,17 1 NM380_005
2 Syslog 2016,09,14 1 NM380_005
3 Syslog 2016,09,08 1 LO380_004
Upvotes: 3