Reputation: 1587
I am learning Pandas and have a DataFrame of strings which looks a little like this:
df = pd.DataFrame([['Apple', 'Med6g7867'], ['Orange', 'Med7g8976'], ['Banana', 'Signal'], ['Peach', 'Med8g8989'], ['Mango', 'Possible result %gggyy']], columns=['A', 'B'])
df
A B
0 Apple Med6g7867
1 Orange Med7g8976
2 Banana Signal
3 Peach Med8g8989
4 Mango Possible result %gggyy
Note column B has two types of value, either a unique identifier of the form MedXgXXXX or a descriptive string. I would like to do two related things.
I would like a table like this:
A B
0 Apple NaN
1 Orange NaN
2 Banana Signal
3 Peach NaN
4 Mango Possible result
Currently I can subset the table like so:
df[df['B'].str.contains("Med")]
df[df['B'].str.contains("%")]
but no implementation of replace()
I try allows me to do this.
Any help appreciated.
Upvotes: 2
Views: 1051
Reputation: 879919
import pandas as pd
df = pd.DataFrame([['Apple', 'Med6g7867'],
['Orange', 'Med7g8976'],
['Banana', 'Signal'],
['Peach', 'Med8g8989'],
['Mango', 'Possible result %gggyy']],
columns=['A', 'B'])
df['B'] = df['B'].str.extract(r'(?:^Med.g.{4})|([^%]+)', expand=False)
print(df)
yields
A B
0 Apple NaN
1 Orange NaN
2 Banana Signal
3 Peach NaN
4 Mango Possible result
The regex pattern has the following meaning:
(?: # start a non-capturing group
^ # match the start of the string
Med # match the literal string Med
. # followed by any character
g # a literal g
.{4} # followed by any 4 characters
) # end the non-capturing group
| # OR
( # start a capturing group
[^%]+ # 1-or-more of any characters except %
) # end capturing group
If the value in the B
column starts with a unique indentifier of the form
MedXgXXXX
then the non-capturing group will be matched. Since str.extract
only returns the value from capturing groups, the Series
returned by
str.extract
will have a NaN
at this location.
If instead the capturing group is matched, then str.extract
will return the
matched value.
Upvotes: 2
Reputation: 16619
You can apply replace twice like this:
In [460]: df
Out[460]:
A B
0 Apple Med6g7867
1 Orange Med7g8976
2 Banana Signal
3 Peach Med8g8989
4 Mango Possible result %gggyy
In [461]: df.replace(r'Med\dg\d{4}', np.nan, regex=True).replace(r'\s+%.*', '', regex=True)
Out[461]:
A B
0 Apple NaN
1 Orange NaN
2 Banana Signal
3 Peach NaN
4 Mango Possible result
Upvotes: 2