Phill Donn
Phill Donn

Reputation: 180

Pandas realization of leave one out encoding for categorical features

I have recently watched a video from Owen Zhang kaggle rank 1 competitor: https://youtu.be/LgLcfZjNF44 where he explains a technique of encoding categorical features to numerical which is called leave one out encoding. What he does to a categorical feature is associate a value with each observation, which is the average of the response for all other observations with same category.

I've been trying to implement this strategy in python using pandas. Although I have managed to build a successful code the fact that my data set is of size of tens of millions its performance is very slow. If someone could bring up a faster solution I'd be very grateful.

This is my code so far:

def categ2numeric(data, train=True):
    def f(series):
        indexes = series.index.values
        pomseries = pd.Series()
        for i, index in enumerate(indexes):
            pom = np.delete(indexes, i)
            pomseries.loc[index] = series[pom].mean()
        series = pomseries
        return series

    if train:
        categ = data.groupby(by=['Cliente_ID'])['Demanda_uni_equil'].apply(f)

And I need to turn this Series:

            159812     28.0
            464556     83.0
            717223     45.0
            1043801    21.0
            1152917     7.0
            Name: 26, dtype: float32

to this:

            159812     39.00
            464556     25.25
            717223     34.75
            1043801    40.75
            1152917    44.25
            dtype: float64

Or mathematically element with index 159812 is equal to the average of all the other elements or:

39 = (83 + 45 + 21 + 7) / 4

Upvotes: 2

Views: 3533

Answers (3)

neil armstrong
neil armstrong

Reputation: 11

there's a library: category_encoders that has similar code syntax of that of sikit-learn.

So, you can use something like:

from category_encoders import LeaveOneOutEncoder

LeaveOneOutEncoder.fit(X, y)

Upvotes: 1

Phill Donn
Phill Donn

Reputation: 180

WIth help from @root I have found out that the fastest solution to this problem would be this kind of approach:

cs = train.groupby(by=['Cliente_ID'])['Demanda_uni_equil'].sum()
cc = train['Cliente_ID'].value_counts()
boolean = (cc == 1)
index = boolean[boolean == True].index.values
cc.loc[boolean] += 1
cs.loc[index] *= 2
train = train.join(cs.rename('sum'), on=['Cliente_ID'])
train = train.join(cc.rename('count'), on=['Cliente_ID'])
train['Cliente_IDloo'] = (train['sum'] - train['Demanda_uni_equil'])/(train['count'] - 1)
del train['sum'], train['count']

I've found that if using the apply method with callable function as input it would take 2 minutes while this approach takes only 1 second it is a bit cumbersome though.

Upvotes: 1

root
root

Reputation: 33793

Replace each element of the Series with difference between the sum of the Series and the element, then divide by the length of the series minus 1. Assuming s is your Series:

s = (s.sum() - s)/(len(s) - 1)

The resulting output:

159812     39.00
464556     25.25
717223     34.75
1043801    40.75
1152917    44.25

Upvotes: 4

Related Questions