Reputation: 9141
I want to essentially find and replace using python.
However, I want to say if a cell contains something, then replace with what I want.
I know
str.replace('safsd','something else')
However, I am not sure how to specify how to get rid of EVERYTHING in that cell. Do I use *
? I am not too familiar with that in python but I know in the bash shell *
references to everything...
I have
df['Description']
that can contain 'optiplex 9010 for classes and research'
which I just want to replace with 'optiplex 9010'
. Or 'macbook air 11 with configurations...etc.'
and I want simply 'macbook air 11'
I am aiming for...
if Df['Description'].str.contains('macbook air 11')
then Df['Description'].str.replace(' (not sure what I put in here) , 'mabook air 11')
Any help/ideas?
Thanks!
**Additional info that may be helfpul...
I am working with thousands of different user inputs. So the 'Descriptions' of what someone has purchased is not going to be the same at all in context, wording, structure, etc. etc. I can either manually go into excel and filter by what contains 'optiplex 9010' and then replace everything with a simple description , doing the same for macbooks, etc.
I figured there may be some simpler way using pandas/python .str.contains and .str.replace.
Hope that extra info helps! Let me know
Upvotes: 4
Views: 8053
Reputation: 103884
You can use a regex on a Pandas series like so.
First create a dumb series of strings:
>>> import re
>>> import pandas as pd
>>> s=pd.Series(['Value {} of 3'.format(e) for e in range(1,4)])
>>> s
0 Value 1 of 3
1 Value 2 of 3
2 Value 3 of 3
Then use a regex sub to replace the string value of all digits with 5
and lower case the string:
>>> s.apply(lambda s: re.sub(r'\d+', '5', s).lower())
0 value 5 of 5
1 value 5 of 5
2 value 5 of 5
dtype: object
Of course if you want to just replace all, you can use a regex or string replace:
>>> s.apply(lambda s: re.sub(r'^.*$', 'GONE!!!', s))
0 GONE!!!
1 GONE!!!
2 GONE!!!
dtype: object
>>> s.apply(lambda s: s.replace(s, 'GONE!!!'))
0 GONE!!!
1 GONE!!!
2 GONE!!!
dtype: object
Upvotes: 0
Reputation: 938
This is a perfect example of a problem that can be solved using regexes. And I also find that a situation like this is a great excuse to learn about them! Here is an incredibly detailed tutorial on how to use regexes http://www.regular-expressions.info/tutorial.html
Upvotes: -1
Reputation: 375565
str.replace takes a regular expression, for example 'macbook air 11'
followed zero (or more) (*
) of any characters (.
) (you could also flag to be case insensitive):
Df['Description'].str.replace('macbook air 11.*' , 'macbook air 11')
A little primer on regex can be found here.
However, you might be better off, especially if you have already have a complete list of topics, to normalize the names (e.g. using fuzzywuzzy like in this question / answer):
from fuzzywuzzy.fuzz import partial_ratio
Df['Description'].apply(lambda x: max(topics, key=lambda t: partial_ratio(x, t)))
Upvotes: 5