EGM8686
EGM8686

Reputation: 1572

Python Compute new variable based on substring

I have a dataframe with one column containing text, and I would like to create a new variable if a substring exists on a column

original df

team
1New York MetsMets1
2Seattle MarinersMariners2
3Los Angeles AngelsAngels3

new df

team                       code  
1New York MetsMets1        NYM 
2Seattle MarinersMariners2 SEA
3Los Angeles AngelsAngels3 ANA

So basically if the column contains the text 'Mets' then code column will be NYM, if it contains 'Mariners' then code equals SEA, etc. The reason for this is because the Team column can contain extraneous characters or extra information that I want to ignore, so I need some type of rule such that:

if exists in team(Mariners) then code = 'SEA'
if exists in team(New York Mets) then code = 'Mets'

Thanks!

Upvotes: 0

Views: 144

Answers (2)

jezrael
jezrael

Reputation: 862601

Create dictonary for lookup values, get substrings by Series.str.extract and then mapping by Series.map:

d= {'Mariners':'SEA','New York Mets':'Mets'}
df['code'] = df['team'].str.extract(f'({"|".join(d)})', expand=False).map(d)

print (df)
                         team  code
0         1New York MetsMets1  Mets
1  2Seattle MarinersMariners2   SEA
2  3Los Angeles AngelsAngels3   NaN

Upvotes: 1

DataSciRookie
DataSciRookie

Reputation: 1200

You should try something like this :

df.loc[df['team'].str.contains('Mariners'),'code']='SEA'

but with a loop depending of the number of different words. You should do some preprocessing, maybe put everything in lower case to be sure you are not missing mariners instead of Mariners

Upvotes: 1

Related Questions