Reputation: 1185

pandas string replace function with regex gives wrong result

dfF:

    Sample  AlmostFinal  
    1          KOPLA234        
    1          KOPLA234
    2          RWPLB253
    3          MMPLA415
    3          MMPLA415

I need to replace KOPL and RWP and MM to KOLPOL and last char a/b should stay. So result shoud be:

    Sample  AlmostFinal  Final
    1          KOPLA234  KOLPOLA234      
    1          KOPLA234  KOLPOLA234
    2          RWPLB253  KOLPOLB253
    3          MMPLA415  KOLPOLA415
    3          MMPLA415  KOLPOLA415

I tried to do it by replace:

    dfF['Final'] = (dfF['AlmostFinal'].replace({'KOPL':'KOLPOL'}, regex = True))
    dfF['Final'] = (dfF['AlmostFinal'].replace({'RWP':'KOLPOL'}, regex = True))
    dfF['Final'] = (dfF['AlmostFinal'].replace({'MMPL':'KOLPOL'}, regex = True))

And: If i comment 2th and 3th line replaces for KOPL works.

When I comment 1st and 3th replace for RWP works.

But when I uncomment all and try to run all 3 lines works only last. Why? In another script I have a similar code and it changes whole while and whole lines works.

Upvotes: 1

Answers (3)

cs95

Reputation: 403198

You can use a single replace call with regex=True:

df['Final'] = df['AlmostFinal'].replace(
    [r'KOPL', r'RWP.*?(?=A|B)', r'MM.*(?=A|B)'], 'KOLPOL', regex=True)
df

   Sample AlmostFinal       Final
0       1    KOPLA234  KOLPOLA234
1       1    KOPLA234  KOLPOLA234
2       2    RWPLB253  KOLPOLB253
3       3    MMPLA415  KOLPOLA415
4       3    MMPLA415  KOLPOLA415

We want to be able to handle varying number of characters between the substrings and the last character, so regex with lookahead will be useful here.

Further generalisation is possible. Just define your substrings, then insert a lookahead via list comp.

pat = ['KOPL', 'RWP', 'MM']
df['Final'] = df['AlmostFinal'].replace(
    [rf'{p}.*(?=A|B)' for p in pat], 'KOLPOL', regex=True)  # need python3.6+
df

   Sample AlmostFinal       Final
0       1    KOPLA234  KOLPOLA234
1       1    KOPLA234  KOLPOLA234
2       2    RWPLB253  KOLPOLB253
3       3    MMPLA415  KOLPOLA415
4       3    MMPLA415  KOLPOLA415

If you want to replace specific substrings, the solution is a little more simple.

pat = ['KOPL', 'RWPL', 'MMPL']
df['AlmostFinal'].replace(pat, 'KOLPOL', regex=True)

0    KOLPOLA234
1    KOLPOLA234
2    KOLPOLB253
3    KOLPOLA415
4    KOLPOLA415
Name: AlmostFinal, dtype: object

No other modifications required. For more general replacements, see above.

Upvotes: 1

DYZ

Reputation: 57115

You should execute one assignment, not three. Otherwise, each next assignment overwrites the results of the previous assignment.

dfF['Final'] = dfF['AlmostFinal']\
               .replace({'KOP|RWP|MMP': 'KOLPO'}, regex = True)

Upvotes: 1

Masklinn

Reputation: 42602

And: If i comment 2th and 3th line replaces for KOPL works. When I comment 1st and 3th replace for RWP works. But when I uncomment all and try to run all 3 lines works only last. Why?

Because replace creates a new dataframe, and since you're always doing the replacement on the one original dataframe, each replace throws away the result of the previous one.

Either do all replacements simultaneously e.g. use a regex or I guess a single dict with multiple values (not sure why you'd use a dict for a single value here really:

{
    'KOPL':'KOLPOL',
    'RWP':'KOLPOL',
    'MMP':'KOLPOL',
}

or perform each replace on the result of the previous one (either chain replace, or the second and third should work on df['Final']).

Upvotes: 1

pandas string replace function with regex gives wrong result

Answers (3)

Related Questions