Reputation: 3482
I have a really simple dataframe for testing purposes. It looks like this:
movieId | title | genres | Drama | Action | Comedy
1 | Toy Story | {'Drama', 'Comedy'} | 0 | 0 | 0
I want to reflect the set genres
in booleans in the respective columns, so the desired result would be:
movieId | title | genres | Drama | Action | Comedy
1 | Toy Story | {'Drama', 'Comedy'} | 1 | 0 | 1
So I tried this code with apply:
def ttb(genreset):
return tuple(1 if g in genreset else 0 for g in all_genres)
all_genres = ('Drama', 'Action', 'Comedy')
df.T.loc[all_genres, :] = df.apply(lambda x: ttb(x.loc['genres']), axis=1)
But this resulted in an error that I can't really wrap my head around:
ValueError: shape mismatch: value array of shape (19,) could not be broadcast to indexing result of shape (19,1)
Do I need to somehow cast the return value of apply
to have a fixed size or why doesn't it work as I would expect? I tried with more data as well, but always got the same error. Googling for the error gave lots of results, but offered no viable solutions for me.
Upvotes: 1
Views: 68
Reputation: 402553
Call str.join
followed by str.get_dummies
:
v = df.genres.str.join(',').str.get_dummies(sep=',')
Or, if "Action" needs to be added explicitly, let's use reindex
:
v = (df['genre']
.str.join(',')
.str.get_dummies(sep=',')
.reindex(
['Comedy', 'Action', 'Drama'],
axis=1,
fill_value=0
)
)
print(v)
Comedy Action Drama
0 1 0 1
If you have many unique values and you're not sure what they are, you can always find their union:
u = set().union(*df.genres.tolist())
And now, use u
to reindex the result.
If you need to add this back to your original DataFrame, use concat
:
df = pd.concat([df, v], axis=1)
Upvotes: 3