Reputation: 539
I want to categorizing data by city based on string in a column in dataframe. Firstly, I've tried to create the if-else statement but the code become so long. So, So, i plan to create the if-else statement based on array in which the query read if there is same data between data in dataframe and array, before categorizing the data.
Sample data:
**full**
london
menchester united
i live in mench
lndon
lndn is huge
chester
scotland
scot
menches
manchaster
My code is
import pandas as pd
data = pd.read_excel (r'/c:/Documents/data.xlsx')
def func(a):
london = ['london','lo','ldn','lnn','lndon','lon','ld','ndn']
manchester = ['hester','ester','mencstr']
if str(london) in a.lower():
return "london"
elif str(manchester) in a.lower():
return "manchester"
else:
return "others"
data["city"] = data["full"].apply(lambda x: func(x))
Initial if-else statement code:
if "london" in a.lower():
return "london"
elif "lond" in a.lower():
return "london"
elif "lndn" in a.lower():
return "london"
elif "menchester" in a.lower():
return "manchester"
elif "hester" in a.lower():
return "manchester"
elif "mnhester" in a.lower():
return "manchester"
else:
return "others"
This code is definitely wrong. but im not sure how to change it so that I dont have to create a long if-else statement, but instead that if-else statement will compare the data from array/dictionary. Note: data in code is just an example, the real data is big.
Upvotes: 1
Views: 193
Reputation: 10970
Create a dictionary for your data and use pandas.Series.str.contains
to check if it exists. Use numpy.where
to conditionally replace. Note that, contains
method uses regex to search in the column
import numpy as np
data = {
'london': ['london','lo','ldn','lnn','lndon','lon','ld','ndn'],
'manchester': ['hester','ester','mencstr']
}
for city, alts in data.items():
df['full'] = np.where(df.full.str.contains('|'.join(alts)), city, df['full'])
OR, Another more efficient one-liner would be to use replace
df.full.replace({'|'.join(d): c for c, d in data.items()})
Output
full
0 london
1 manchester
2 mench
3 london
4 london
5 manchester
6 scotland
7 scot
8 menches
9 manchaster
Upvotes: 1
Reputation: 91
import pandas as pd
data = pd.read_excel (r'/c:/Documents/data.xlsx')
def func(a):
# creating dict like {"lnn":"london","ld":"london"}
london_dict = {k:"london" for k in ['london','lo','ldn','lnn','lndon','lon','ld','ndn']}
menchester_dict = {k:"menchester" for k in ['hester','ester','mencstr']}
# Merging all the dicts
city_dict = {**london_dict ,**menchester_dict }
if a.lower() in city_dict.keys():
return city_dict[a.lower()]
else:
return "others"
data["city"] = data["full"].apply(lambda x: func(x))
Upvotes: 1