Reputation: 469

replacing a column in DataFrame using regex

I have a data frame with 4 columns, col4 is a string including texts and digits:

 Col1           Col2              Col3           Col4
Syslog        2016,09,17           1    PD380_003 %LINK-3-UPDOWN
Syslog        2016,09,17           1    NM380_005 %BGP-5-NBR_RESET
Syslog        2016,09,14           1    NM380_005 %BGP-5-NBR_RESET
Syslog        2016,09,08           1    DO NOT TICKET LO380_004 %SYS-5-CONFIG_I Config

i need to keep a substring of that column and delete anything else, so I used regex and i made a pattern but when i run the following query the result is not what i want, it replace everything with the pattern itself:

data.replace({'Col4':{'.*':'([A-Z]{2}[0-9]{3}_[0-9]{3})'}},regex=True)

desired result is:

 Col1           Col2              Col3           Col4
Syslog        2016,09,17           1           PD380_003
Syslog        2016,09,17           1           NM380_005
Syslog        2016,09,14           1           LO380_004
Syslog        2016,09,08           1           LO380_004

but the result i get is like:

 Col1           Col2              Col3           Col4
Syslog        2016,09,17           1    ([A-Z]{2}[0-9]{3}_[0-9]{3})
Syslog        2016,09,17           1    ([A-Z]{2}[0-9]{3}_[0-9]{3})
Syslog        2016,09,14           1    ([A-Z]{2}[0-9]{3}_[0-9]{3})
Syslog        2016,09,08           1    ([A-Z]{2}[0-9]{3}_[0-9]{3})

what am i doing wrong?

Upvotes: 3

Answers (3)

dawg

Reputation: 104111

First, you have the wrong regex's in the wrong positions. The to_replace argument to .replace needs to match what to replace and what to delete. So you need a ^.* in front of and a .*$ behind your regex in this case since you want to trim the string outside the match:

^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$

Demo

Second, the replace argument, if a regex, needs to be a capturing group or fixed string. In this case \1 will do.

Last, the Series form of .replace has a littler simpler syntax (at least for me) to understand.

So given:

>>> df
     Col1        Col2  Col3                                            Col4
0  SysLog  2016,09,17     1                        PD380_003 %LINK-3-UPDOWN
1  SysLog  2016,09,17     1                      NM380_005 %BGP-5-NBR_RESET
2  SysLog  2016,09,17     1                      NM380_005 %BGP-5-NBR_RESET
3  SysLog  2016,09,17     1  DO NOT TICKET LO380_004 %SYS-5-CONFIG_I Config

You can do:

>>> df['Col4'].replace(to_replace='^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$', value=r'\1', regex=True) 
0    PD380_003
1    NM380_005
2    NM380_005
3    LO380_004
Name: Col4, dtype: object

You can also use a positional argument version if easier:

df['Col4'].replace('^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$', r'\1', regex=True)

but you need to have regex=True since the replacement string is to be interpreted as a regex -- not just a static string.

Finally, assign directly into the original:

>>> df['Col4']=df['Col4'].replace('^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$', r'\1', regex=True)
>>> df
     Col1        Col2  Col3       Col4
0  SysLog  2016,09,17     1  PD380_003
1  SysLog  2016,09,17     1  NM380_005
2  SysLog  2016,09,17     1  NM380_005
3  SysLog  2016,09,17     1  LO380_004

Upvotes: 3

MaxU - stand with Ukraine

Reputation: 210982

You were using RegEx in a wrong way.

{'Col4':{'.*':'([A-Z]{2}[0-9]{3}_[0-9]{3})'}} - means replace a whatever string in the Col4 column with '([A-Z]{2}[0-9]{3}_[0-9]{3})'

Try this:

In [87]: df.replace({'Col4':{r'.*?([A-Z]{2}[0-9]{3}_[0-9]{3}).*':r'\1'}}, regex=True)
Out[87]:
     Col1        Col2  Col3       Col4
0  Syslog  2016,09,17     1  PD380_003
1  Syslog  2016,09,17     1  NM380_005
2  Syslog  2016,09,14     1  NM380_005
3  Syslog  2016,09,08     1  LO380_004

Upvotes: 1

jezrael

Reputation: 863791

I think you need extract:

data.Col4 = data.Col4.str.extract('([A-Z]{2}[0-9]{3}_[0-9]{3})', expand=False)

print (data)
     Col1        Col2  Col3       Col4
0  Syslog  2016,09,17     1  PD380_003
1  Syslog  2016,09,17     1  NM380_005
2  Syslog  2016,09,14     1  NM380_005
3  Syslog  2016,09,08     1  LO380_004

Upvotes: 3

replacing a column in DataFrame using regex

Answers (3)

Related Questions