G_T
G_T

Reputation: 1587

Conditional editing of strings in a Pandas DataFrame

I am learning Pandas and have a DataFrame of strings which looks a little like this:

df = pd.DataFrame([['Apple', 'Med6g7867'], ['Orange', 'Med7g8976'], ['Banana', 'Signal'], ['Peach', 'Med8g8989'], ['Mango', 'Possible result %gggyy']], columns=['A', 'B'])
df
    A       B
0   Apple   Med6g7867
1   Orange  Med7g8976
2   Banana  Signal
3   Peach   Med8g8989
4   Mango   Possible result %gggyy

Note column B has two types of value, either a unique identifier of the form MedXgXXXX or a descriptive string. I would like to do two related things.

  1. Substitute all the values of B with the unique identifier to NaN
  2. Retain the descriptive string but truncate any that have a % sign so that I only retain the string prior to the % sign.

I would like a table like this:

    A       B
0   Apple   NaN
1   Orange  NaN
2   Banana  Signal
3   Peach   NaN
4   Mango   Possible result

Currently I can subset the table like so:

df[df['B'].str.contains("Med")]
df[df['B'].str.contains("%")]

but no implementation of replace() I try allows me to do this.

Any help appreciated.

Upvotes: 2

Views: 1051

Answers (2)

unutbu
unutbu

Reputation: 879919

import pandas as pd
df = pd.DataFrame([['Apple', 'Med6g7867'],
                   ['Orange', 'Med7g8976'],
                   ['Banana', 'Signal'],
                   ['Peach', 'Med8g8989'],
                   ['Mango', 'Possible result %gggyy']],
                  columns=['A', 'B'])

df['B'] = df['B'].str.extract(r'(?:^Med.g.{4})|([^%]+)', expand=False)
print(df)

yields

        A                 B
0   Apple               NaN
1  Orange               NaN
2  Banana            Signal
3   Peach               NaN
4   Mango  Possible result 

The regex pattern has the following meaning:

(?:            # start a non-capturing group
  ^            # match the start of the string
  Med          # match the literal string Med
  .            # followed by any character
  g            # a literal g
  .{4}         # followed by any 4 characters
)              # end the non-capturing group
|              # OR
(              # start a capturing group
  [^%]+        # 1-or-more of any characters except %
)              # end capturing group

If the value in the B column starts with a unique indentifier of the form MedXgXXXX then the non-capturing group will be matched. Since str.extract only returns the value from capturing groups, the Series returned by str.extract will have a NaN at this location.

If instead the capturing group is matched, then str.extract will return the matched value.

Upvotes: 2

Nehal J Wani
Nehal J Wani

Reputation: 16619

You can apply replace twice like this:

In [460]: df
Out[460]: 
        A                       B
0   Apple               Med6g7867
1  Orange               Med7g8976
2  Banana                  Signal
3   Peach               Med8g8989
4   Mango  Possible result %gggyy

In [461]: df.replace(r'Med\dg\d{4}', np.nan, regex=True).replace(r'\s+%.*', '', regex=True)
Out[461]: 
        A                B
0   Apple              NaN
1  Orange              NaN
2  Banana           Signal
3   Peach              NaN
4   Mango  Possible result

Upvotes: 2

Related Questions