Reputation: 1500
I have a column with values listed as a string, delimited with '; ' and I would like to generate true/false columns for each separate value. The dataframe is large (100k rows, 13 values) and my current code does around 1.5 rows per second in that dataframe. Is there a dedicated pandas function that would do this much faster? Or is there an obvious mistake I'm making that slows my code down impossibly?
Data looks like:
id likes colour
0 1 blue
1 2 blue; green; red
2 3 red; purple
3 4 purple; blue
4 5 yellow; red
5 6 yellow
Output is:
id likes colour red yellow blue purple green
0 1 blue False False True False False
1 2 blue; green; red True False True False True
2 3 red; purple True False False True False
3 4 purple; blue False False True True False
4 5 yellow; red True True False False False
5 6 yellow False True False False False
Current function to go from first to second frame:
def do_thing(data, col_name = 'likes colour'):
cols = list(set([j for i in list(data[col_name]) for j in i.split('; ')]))
for i in cols:
data[i] = None
for i in data.index:
for j in cols:
if j in data.loc[i, col_name]:
data.loc[i, j] = True
else:
data.loc[i, j] = False
return data
Basically, this first identifies each different value, then creates empty columns for each and finally fills these with True or False there or not a given word is there.
It's fine for the sample dataframe, but once I add a few more colors and multiply by 10000 it goes at a snail's pace. Help is much appreciated!
Upvotes: 0
Views: 1127
Reputation: 18647
Try using Series.str.get_dummies
and DataFrame.join
:
# Setup
df = pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6}, 'likes colour': {0: 'blue', 1: 'blue; green; red', 2: 'red; purple', 3: 'purple; blue', 4: 'yellow; red', 5: 'yellow'}})
df.join(df['likes colour'].str.get_dummies(sep='; ').astype(bool))
[out]
id likes colour blue green purple red yellow
0 1 blue True False False False False
1 2 blue; green; red True True False True False
2 3 red; purple False False True True False
3 4 purple; blue True False True False False
4 5 yellow; red False False False True True
5 6 yellow False False False False True
Upvotes: 9