Alexis
Alexis

Reputation: 9141

How to replace all words in a series with a few specified words in Pandas,Python?

I want to essentially find and replace using python.

However, I want to say if a cell contains something, then replace with what I want.

I know

str.replace('safsd','something else')

However, I am not sure how to specify how to get rid of EVERYTHING in that cell. Do I use *? I am not too familiar with that in python but I know in the bash shell * references to everything...

I have

df['Description'] 

that can contain 'optiplex 9010 for classes and research' which I just want to replace with 'optiplex 9010'. Or 'macbook air 11 with configurations...etc.' and I want simply 'macbook air 11'

I am aiming for...

if  Df['Description'].str.contains('macbook air 11')
  then Df['Description'].str.replace(' (not sure what I put in here) , 'mabook air 11')

Any help/ideas?

Thanks!

**Additional info that may be helfpul...

I am working with thousands of different user inputs. So the 'Descriptions' of what someone has purchased is not going to be the same at all in context, wording, structure, etc. etc. I can either manually go into excel and filter by what contains 'optiplex 9010' and then replace everything with a simple description , doing the same for macbooks, etc.

I figured there may be some simpler way using pandas/python .str.contains and .str.replace.

Hope that extra info helps! Let me know

Upvotes: 4

Views: 8053

Answers (3)

dawg
dawg

Reputation: 103884

You can use a regex on a Pandas series like so.

First create a dumb series of strings:

>>> import re
>>> import pandas as pd
>>> s=pd.Series(['Value {} of 3'.format(e) for e in range(1,4)])
>>> s
0     Value 1 of 3
1     Value 2 of 3
2     Value 3 of 3

Then use a regex sub to replace the string value of all digits with 5 and lower case the string:

>>> s.apply(lambda s: re.sub(r'\d+', '5', s).lower())
0    value 5 of 5
1    value 5 of 5
2    value 5 of 5
dtype: object

Of course if you want to just replace all, you can use a regex or string replace:

>>> s.apply(lambda s: re.sub(r'^.*$', 'GONE!!!', s))
0    GONE!!!
1    GONE!!!
2    GONE!!!
dtype: object
>>> s.apply(lambda s: s.replace(s, 'GONE!!!'))
0    GONE!!!
1    GONE!!!
2    GONE!!!
dtype: object

Upvotes: 0

nagyben
nagyben

Reputation: 938

This is a perfect example of a problem that can be solved using regexes. And I also find that a situation like this is a great excuse to learn about them! Here is an incredibly detailed tutorial on how to use regexes http://www.regular-expressions.info/tutorial.html

Upvotes: -1

Andy Hayden
Andy Hayden

Reputation: 375565

str.replace takes a regular expression, for example 'macbook air 11' followed zero (or more) (*) of any characters (.) (you could also flag to be case insensitive):

Df['Description'].str.replace('macbook air 11.*' , 'macbook air 11')

A little primer on regex can be found here.

However, you might be better off, especially if you have already have a complete list of topics, to normalize the names (e.g. using fuzzywuzzy like in this question / answer):

from fuzzywuzzy.fuzz import partial_ratio
Df['Description'].apply(lambda x: max(topics, key=lambda t: partial_ratio(x, t)))

Upvotes: 5

Related Questions