ELS
ELS

Reputation: 11

Assign Values in Column from Substring

I am new to Pandas and have a hard time figuring out the best way to solve the below problem:

I have a Dataframe with one column called Email like below:

Email
[email protected]
[email protected]
NAN
[email protected]
[email protected]

I separated the strings on '@' to create a Domain column and want to use that column to assign keywords in a new column. For instance, if the domain contains the word 'yahoo', call it 'Yahoo Account' in the new column. If it does not contain the word 'yahoo', assign it the value 'Other Domain', and if it is NaN, call it 'Unknown'.

The new column called Affiliation would look like:

Affiliation 
Yahoo Account 
Other Domain 
Unknown 
Yahoo Account 
Other Domain

There are over 2,000 different types of domains so am looking for a way where I don't list and map all the unique domains as either being "Yahoo Account" or "Other Domain."

I have looked into a few options, one of which is the where clause, but it assigns NaN values to the Other Domain keyword.

df['Affiliation'] = np.where(df['Domain']=='yahoo', 'Yahoo Account', 'Other Domain')

I have also started to look at using the replace clause, but don't think this is the best way due to the amount of unique domains there are that would need to be added to other_affiliations. See below:

yahoo_affiliations = (r'(yahoo\S*)')
other_affiliations= (r'(gmail\S*)|(hotmail\S*)|(outlook\S*)')

# Create a new column called Affiliation from Domains
df['Affiliation'] = df['Domain']

# Fill NaN with Unknwon
df['Affiliation']  = df['Affiliation'].fillna('Unknown')

replacements = {
           'Affiliation': {yahoo_affiliations: 'Yahoo Account',
                                        other_affiliations: 'Other Domain'}
                        }

df.replace(replacements, regex=True, inplace=True)

Upvotes: 0

Views: 46

Answers (1)

oppressionslayer
oppressionslayer

Reputation: 7204

You can split them like this to get your mappings

email_map = {'yahoo.com': 'Yahoo Account',
'gmail.com': 'Other Domain',
'gmail.com.it': 'Other Domain' 
}
dfa['domain'] = dfa['Email'].str.extract(r'.*?@(.*)') 
dfa['Affiliation'] = dfa['domain'].map(email_map).fillna('Unknown') 

output:

              Email        domain    Affiliation
0     [email protected]     yahoo.com  Yahoo Account
1     [email protected]     gmail.com   Other Domain
2               NAN           NaN        Unknown
3     [email protected]     yahoo.com  Yahoo Account
4  [email protected]  gmail.com.it   Other Domain

Upvotes: 1

Related Questions