Reputation: 341
Implementing the following logic for a feature engineering purpose. A simple approach is easy but wondering if there is a more efficient solution that anyone can think of. Ideas are appreciated if you don't feel like implementing the whole code!
Take this DataFrame and dictionary
import pandas as pd
random_animals = pd.DataFrame(
{'description':['xdogx','xcatx','xhamsterx','xdogx'
,'xhorsex','xdonkeyx','xcatx']
})
cat_dict = {'category_a':['dog','cat']
,'category_b':['horse','donkey']}
We want to create a column/feature for each string in the dictionary AND for each category. 1 if string is contained in the description
column 0 otherwise.
So the output for this toy example would look like:
description is_dog is_cat is_horse is_donkey is_category_a is_category_b
0 xdogx 1 0 0 0 1 0
1 xcatx 0 1 0 0 1 0
2 xhamsterx 0 0 0 0 0 0
3 xdogx 1 0 0 0 1 0
4 xhorsex 0 0 1 0 0 1
5 xdonkeyx 0 0 0 1 0 1
6 xcatx 0 1 0 0 1 0
Simple approach would be iterating once for each output column required and running (for each column, just hardcoded is_dog here for simplicity)
random_animals['is_dog'] = random_animals['description'].str.contains('dog')*1
There can be an arbitrary number of strings and categories in the cat_dict
so I am wondering if there is a way to do this otherwise.
Upvotes: 0
Views: 387
Reputation: 11602
Here is a vectorized method. The main observation is that random_animals.description.str.contains
when applied to a string returns a Series of indicators, one for each row of random_animals
.
Since random_animals.description.str.contains
is itself a vectorized function, we can apply it to the collection of animals to obtain a full indicator matrix.
Finally, we can add categories by enforcing logic between different columns. This will likely be faster than checking for string inclusion multiple times.
import pandas as pd
random_animals = pd.DataFrame(
{'description':['xdogx','xcatx','xhamsterx','xdogx'
,'xhorsex','xdonkeyx','xcatx']
})
cat_dict = {'category_a':['dog', 'cat']
,'category_b':['horse', 'donkey']}
# create a Series containing all individual animals (without duplicates)
animals = pd.Series([animal for v in cat_dict.values()
for animal in v])
df = pd.DataFrame(
animals.apply(random_animals.description.str.contains).T.values,
index = random_animals.description,
columns = animals).astype(int)
for cat, animals in cat_dict.items():
df[cat] = df[animals].any(axis=1).astype(int)
# dog cat horse donkey category_a category_b
# description
# xdogx 1 0 0 0 1 0
# xcatx 0 1 0 0 1 0
# xhamsterx 0 0 0 0 0 0
# xdogx 1 0 0 0 1 0
# xhorsex 0 0 1 0 0 1
# xdonkeyx 0 0 0 1 0 1
# xcatx 0 1 0 0 1 0
Upvotes: 2
Reputation: 126
Interesting problem. I coded what you want below, but there's problably a shorter way to do that:
#Creating the DataFrame with columns of zeros
names = [x[1:-1] for x in random_animals.description.unique()]
categories = list(cat_dict.keys())
columns = names + categories
df_names = pd.DataFrame(0, index=np.arange(len(random_animals)),
columns=columns)
df = pd.concat([random_animals, df_names], axis = 1)
#Populating the Dataframe - Automating your solution
#For animal names
for i in range(len(df.columns)-1):
df[df.columns[i+1]] = df['description'].str.contains(df.columns[i+1])*1
#For categories
if df.columns[i+1] in list(cat_dict.keys()):
searchfor = cat_dict[df.columns[i+1]]
df[df.columns[i+1]]= df['description'].str.contains('|'.join(searchfor))*1
#Finally renaming names pattern of columns from "dog" to "is_dog"...:
for column in df.columns:
if column in names:
column_new = "is_"+column
df[column_new] = df[column]
df = df.drop(column, axis =1)
Upvotes: 2
Reputation: 1462
You could extend the pandas DataFrame class and implement a lazy column evaluation where if the derived column does not exist, implement the logic and add it to the base class columns collection.
Upvotes: 0