Reputation: 1387
I have a pandas column having email domain, something like this:
Sno Domain_IDs
1 herowire.com
2 xyzenerergy.com
3 financial.com
4 oo-loans.com
5 okwire.com
6 cleaneneregy.com
7 pop-advisors.com
and so on....
I have the following caterogies in a separate dataframe:
Sno category
1 contains wire
2 contains energy
3 contains loans
4 contains advisors
I want to create a dataframe which categorizes the data as follow:
Sno Domain_IDS category
1 herowire.com contains wire
2 xyzenerergy.com contains energy
3 financial.com others
4 oo-loans.com contains loans
5 okwire.com contains wire
6 cleaneneregy.com contains energy
7 pop-advisors.com contains advisors
I tried using lambda functions and standard loop using "if else" statements, by using
"emailAddress.str.contains('wire')"
contains clause, but I am getting the following error:
AttributeError: 'str' object has no attribute 'str'
Somehow i am not able to parse the single line of text in the dataframe. Kindly help.
Upvotes: 2
Views: 4123
Reputation: 730
This solution allows for multiple categorizations:
categories = pd.DataFrame({"category": ["wire", "energy", "loans", "advisors"]})
domains = pd.DataFrame({"Sno": list(range(1, 10)),
"Domain_IDs": [
"herowire.com",
"xyzenergy.com",
"financial.com",
"oo-loans.com",
"okwire.com",
"cleanenergy.com",
"pop-advisors.com",
"energy-advisors.com",
"wire-loans.com"]})
categories["common"] = 0
domains["common"] = 0
possibilities = pd.merge(categories, domains, how="outer")
possibilities["satisfied"] = possibilities.apply(lambda row: row["category"] in row["Domain_IDs"], axis=1)
So filtering only categories which are satisfied:
possibilities[possibilities["satisfied"]]
gives:
category common Domain_IDs Sno satisfied
0 wire 0 herowire.com 1 True
4 wire 0 okwire.com 5 True
8 wire 0 wire-loans.com 9 True
10 energy 0 xyzenergy.com 2 True
14 energy 0 cleanenergy.com 6 True
16 energy 0 energy-advisors.com 8 True
21 loans 0 oo-loans.com 4 True
26 loans 0 wire-loans.com 9 True
33 advisors 0 pop-advisors.com 7 True
34 advisors 0 energy-advisors.com 8 True
Upvotes: 1
Reputation: 38415
Find a pattern in domains, extract and create category
pat = '('+'|'.join(cat['Sno category'].str.split().str[-1])+')'
df['category'] = ('contains ' + df['Domain_IDs'].str.extract(pat)).fillna('other')
Sno Domain_IDs category
0 1 herowire.com contains wire
1 2 xyzenenergy.com contains energy
2 3 financial.com other
3 4 oo-loans.com contains loans
4 5 okwire.com contains wire
5 6 cleaneneregy.com other
6 7 pop-advisors.com contains advisors
Upvotes: 4
Reputation: 1475
lst = ["wire", "energy", "loans","advisors"]
def fun(a):
for i in lst:
if i in a:
return i
return "others"
df["category"] = df.Domain_IDs.apply(lambda x: fun(x))
df
Sno Domain_IDs category
0 1 herowire.com wire
1 2 xyzenenergy.com energy
2 3 financial.com others
3 4 oo-loans.com loans
4 5 okwire.com wire
5 6 cleanenergy.com energy
6 7 pop-advisors.com advisors
Upvotes: 1