Reputation: 2443
I have no idea to extract domain part from email address with pandas. In case if it is '[email protected]' I would like to get 'gmail.com'.
Please give me an idea.
Upvotes: 5
Views: 11767
Reputation: 23
Though both answers are useful, I did some checking on which one is the fastest. From the answers provided by Jezrael and Akhilesh only Jezrael's first method is robust against nan values. However, Akhilesh' answer is the fastest by quite a margin.
Timing was done as follows:
df = pd.DataFrame({'email':['[email protected]','[email protected]']})
def method1():
df['domain'] = df['email'].str.split('@').str[1]
return df
def method2():
df['domain'] = df['email'].apply(lambda x: x.split('@')[1])
return df
def method3():
df['domain'] = [x.split('@')[1] for x in df['email']]
print('Time for method 1:', timeit.timeit(method1, number=100000))
print('Time for method 2:', timeit.timeit(method2, number=100000))
print('Time for method 3:', timeit.timeit(method3, number=100000))
Results with nan values:
Results without nan values:
Upvotes: 0
Reputation: 11
This can also be done using lambda function.
df = pd.DataFrame({'email':['[email protected]','[email protected]', '[email protected]']})
df['domain'] = df['email'].apply(lambda x: x.split('@')[1])
Upvotes: 1
Reputation: 863166
I believe you need split
and select second value of lists by indexing:
df = pd.DataFrame({'email':['[email protected]','[email protected]']})
df['domain'] = df['email'].str.split('@').str[1]
#faster solution if no NaNs values
#df['domain'] = [x.split('@')[1] for x in df['email']]
print (df)
email domain
0 [email protected] gmail.com
1 [email protected] yahoo.com
Upvotes: 18