How to encode multiple categorical columns for test data efficiently?

Question

I have multiple category columns (nearly 50). I using custom made frequency encoding and using it on training data. At last i am saving it as nested dictionary. For the test data I am using map function to encode and unseen labels are replaced with 0. But I need more efficient way?

I have already tried pandas replace method but it don't cares of unseen labels and leaves it as it. Further I am much concerned about the time and i want say 80 columns and 1 row to be encoded within 60 ms. Just need the most efficient way I can do it. I have taken my example from here.

import pandas
from sklearn import preprocessing 

df = pandas.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'], 
                       'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
                       'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
             'New_York']})

My dict looks something like this :

enc = {'pets': {'cat': 0, 'dog': 1, 'monkey': 2},
       'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
       'location': {'New_York': 0, 'San_Diego': 1}}

for col in enc:
    if col in input_df.columns:
        input_df[col]= input_df[col].map(dict_online['encoding'][col]).fillna(0)

Further I want multiple columns to be encoded at once. I don't want any loop for every column.... I guess we cant do it in map. Hence replace is good choice but in that as said it doesn't cares about unseen labels.

EDIT:

This the code i am using for now, Please note there is only 1 row in test data frame ( Not very sure i should handle it like numpy array to reduce time...). But i need to decrease this time to under 60 ms: Further i have dictionary only for mapping ( Cant use one hot because of use case). Currently time = 331.74 ms. Any idea how to do it more efficiently. Not sure that multiprocessing will work..? Further with replace method i have got many issues like : 1. It does not handle unseen labels and leave them as it is ( for string its issue). 2. It has problem with overlapping of keys and values.

from string import ascii_lowercase
import itertools
import pandas as pd
import numpy as np
import time

def iter_all_strings():
    for size in itertools.count(1):
       for s in itertools.product(ascii_lowercase, repeat=size):
           yield "".join(s)


l = []
for s in iter_all_strings():
    l.append(s)
    if s == 'gr':
        break

columns = l
df = pd.DataFrame(columns=columns)
for col in df.columns:
    df[col] = np.random.randint(1, 4000, 3000)

transform_dict = {}
for col in df.columns:
    cats = pd.Categorical(df[col]).categories
    d = {}
    for i, cat in enumerate(cats):
        d[cat] = i
    transform_dict[col] = d
print(f"The length of the dictionary is {len(transform_dict)}")


# Creating another test data frame
df2 = pd.DataFrame(columns=columns)
for col in df2.columns:
    df2[col] = np.random.randint(1, 4000, 1)
print(f"The shape of teh 2nd data frame is {df2.shape}")

t1 = time.time()

for col in df2.columns:
    df2[col] = df2[col].map(transform_dict[col]).fillna(0)
print(f"Time taken is {time.time() - t1}")
# print(df)

Venkatachalam · Accepted Answer

Firstly, when you want to encode categorical variables, which is not ordinal (meaning: there is no inherent ordering between the values of the variable/column. ex- cat, dog), you must use one hot encoding.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder 

df = pd.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'], 
                   'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
                   'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
             'New_York']})

enc = [['cat','dog','monkey'],
       ['Brick', 'Champ', 'Ron', 'Veronica'],
       ['New_York', 'San_Diego']]
ohe = OneHotEncoder(categories=enc, handle_unknown='ignore', sparse=False)

Here, I have modified your enc in a way that can be fed into the OneHotEncoder.

Now comes the point of how can we going to handle the unseen labels?

when you handle_unknown as False, the unseen values will have zeros in all the dummy variables, which in a way would help the model to understand its a unknown value.

colnames= ['{}_{}'.format(col,val) for col,unique_values in zip(df.columns,ohe.categories_) \
                                       for val in unique_values]
pd.DataFrame(ohe.fit_transform(df), columns=colnames)

Update:

If you are fine with ordinal endocing, the following change could help.


df2.apply(lambda row: [transform_dict[val].get(col,0) \
                                    for val,col in row.items()], 
          axis=1,
          result_type='expand')

#1000 loops, best of 3: 1.17 ms per loop

How to encode multiple categorical columns for test data efficiently?

EDIT:

Answers (1)

Update:

Related Questions