swm
swm

Reputation: 539

If else statement based on array/dictionary

I want to categorizing data by city based on string in a column in dataframe. Firstly, I've tried to create the if-else statement but the code become so long. So, So, i plan to create the if-else statement based on array in which the query read if there is same data between data in dataframe and array, before categorizing the data.

Sample data:

**full**
london
menchester united 
i live in mench
lndon
lndn is huge
chester
scotland
scot
menches
manchaster

My code is

import pandas as pd

    data = pd.read_excel (r'/c:/Documents/data.xlsx')
    
    def func(a):
        london = ['london','lo','ldn','lnn','lndon','lon','ld','ndn']
        manchester = ['hester','ester','mencstr']
    
        if str(london) in a.lower():
            return "london"
        elif str(manchester) in a.lower():
            return "manchester"
        else:
            return "others"
    
    data["city"] = data["full"].apply(lambda x: func(x))

Initial if-else statement code:

          if "london" in a.lower():
                return "london"
          elif "lond" in a.lower():
                return "london"
          elif "lndn" in a.lower():
                return "london"
            elif "menchester" in a.lower():
                return "manchester"
            elif "hester" in a.lower():
                return "manchester"
            elif "mnhester" in a.lower():
                return "manchester"
            else:
                return "others"

This code is definitely wrong. but im not sure how to change it so that I dont have to create a long if-else statement, but instead that if-else statement will compare the data from array/dictionary. Note: data in code is just an example, the real data is big.

Upvotes: 1

Views: 193

Answers (2)

Vishnudev Krishnadas
Vishnudev Krishnadas

Reputation: 10970

Create a dictionary for your data and use pandas.Series.str.contains to check if it exists. Use numpy.where to conditionally replace. Note that, contains method uses regex to search in the column

import numpy as np

data = {
    'london': ['london','lo','ldn','lnn','lndon','lon','ld','ndn'],
    'manchester': ['hester','ester','mencstr']
}

for city, alts in data.items():
    df['full'] = np.where(df.full.str.contains('|'.join(alts)), city, df['full'])

OR, Another more efficient one-liner would be to use replace

df.full.replace({'|'.join(d): c for c, d in data.items()})

Output

         full
0      london
1  manchester
2       mench
3      london
4      london
5  manchester
6    scotland
7        scot
8     menches
9  manchaster

Upvotes: 1

Rahul Sinha
Rahul Sinha

Reputation: 91

import pandas as pd

data = pd.read_excel (r'/c:/Documents/data.xlsx')

def func(a):

    # creating dict like {"lnn":"london","ld":"london"}
    london_dict = {k:"london" for k in ['london','lo','ldn','lnn','lndon','lon','ld','ndn']}
    menchester_dict = {k:"menchester" for k in ['hester','ester','mencstr']}

    # Merging all the dicts
    city_dict = {**london_dict ,**menchester_dict }

    if a.lower() in city_dict.keys():
        return city_dict[a.lower()]
    else:
        return "others"

data["city"] = data["full"].apply(lambda x: func(x))

Upvotes: 1

Related Questions