user4505419
user4505419

Reputation: 341

Pandas create dummy features for each string in a dictionary of lists

Implementing the following logic for a feature engineering purpose. A simple approach is easy but wondering if there is a more efficient solution that anyone can think of. Ideas are appreciated if you don't feel like implementing the whole code!

Take this DataFrame and dictionary

import pandas as pd
random_animals = pd.DataFrame(
                {'description':['xdogx','xcatx','xhamsterx','xdogx'
                                ,'xhorsex','xdonkeyx','xcatx']
                })


cat_dict = {'category_a':['dog','cat']
            ,'category_b':['horse','donkey']}

We want to create a column/feature for each string in the dictionary AND for each category. 1 if string is contained in the description column 0 otherwise.

So the output for this toy example would look like:

  description  is_dog is_cat is_horse is_donkey is_category_a is_category_b
0       xdogx       1      0        0         0             1             0
1       xcatx       0      1        0         0             1             0    
2   xhamsterx       0      0        0         0             0             0
3       xdogx       1      0        0         0             1             0
4     xhorsex       0      0        1         0             0             1
5    xdonkeyx       0      0        0         1             0             1
6       xcatx       0      1        0         0             1             0

Simple approach would be iterating once for each output column required and running (for each column, just hardcoded is_dog here for simplicity)

random_animals['is_dog'] = random_animals['description'].str.contains('dog')*1

There can be an arbitrary number of strings and categories in the cat_dict so I am wondering if there is a way to do this otherwise.

Upvotes: 0

Views: 387

Answers (3)

hilberts_drinking_problem
hilberts_drinking_problem

Reputation: 11602

Here is a vectorized method. The main observation is that random_animals.description.str.contains when applied to a string returns a Series of indicators, one for each row of random_animals.

Since random_animals.description.str.contains is itself a vectorized function, we can apply it to the collection of animals to obtain a full indicator matrix.

Finally, we can add categories by enforcing logic between different columns. This will likely be faster than checking for string inclusion multiple times.

import pandas as pd
random_animals = pd.DataFrame(
                {'description':['xdogx','xcatx','xhamsterx','xdogx'
                                ,'xhorsex','xdonkeyx','xcatx']
                })


cat_dict = {'category_a':['dog', 'cat']
            ,'category_b':['horse', 'donkey']}

# create a Series containing all individual animals (without duplicates)
animals = pd.Series([animal for v in cat_dict.values()
        for animal in v])

df = pd.DataFrame(
        animals.apply(random_animals.description.str.contains).T.values,
        index  = random_animals.description,
        columns = animals).astype(int)

for cat, animals in cat_dict.items():
    df[cat] = df[animals].any(axis=1).astype(int)

             # dog  cat  horse  donkey  category_a  category_b
# description
# xdogx          1    0      0       0           1           0
# xcatx          0    1      0       0           1           0
# xhamsterx      0    0      0       0           0           0
# xdogx          1    0      0       0           1           0
# xhorsex        0    0      1       0           0           1
# xdonkeyx       0    0      0       1           0           1
# xcatx          0    1      0       0           1           0

Upvotes: 2

Antonio Luis Sombra
Antonio Luis Sombra

Reputation: 126

Interesting problem. I coded what you want below, but there's problably a shorter way to do that:

#Creating the DataFrame with columns of zeros

names = [x[1:-1] for x in random_animals.description.unique()]
categories = list(cat_dict.keys())
columns = names + categories
df_names = pd.DataFrame(0, index=np.arange(len(random_animals)), 
columns=columns)
df = pd.concat([random_animals, df_names], axis = 1)

#Populating the Dataframe - Automating your solution

#For animal names
for i in range(len(df.columns)-1):
    df[df.columns[i+1]] = df['description'].str.contains(df.columns[i+1])*1

#For categories
if df.columns[i+1] in list(cat_dict.keys()):
    searchfor = cat_dict[df.columns[i+1]]
    df[df.columns[i+1]]= df['description'].str.contains('|'.join(searchfor))*1

#Finally renaming names pattern of columns from "dog" to "is_dog"...:

for column in df.columns:
 if column in names:
     column_new = "is_"+column
     df[column_new] = df[column]
     df = df.drop(column, axis =1)

Upvotes: 2

SKG
SKG

Reputation: 1462

You could extend the pandas DataFrame class and implement a lazy column evaluation where if the derived column does not exist, implement the logic and add it to the base class columns collection.

Upvotes: 0

Related Questions