Reputation: 31
I have a dataframe with tag as column and movieId as index. movieId represents movies. So there may be multiple tags for the same movieId.
I want to transform this dataframe and there are as many columns as there are tags and that moviesId become the lines but there must be one line per movieId. And that for each movieId if there was a tag there is a 1 in the tag column at otherwise 0
I can't do it when I try, the moviesID appear several times.
Would someone help me
Thank you so much.
Upvotes: 1
Views: 90
Reputation: 663
We can use ".crosstab()" function to get the required output:
I have created a sample dataframe as 'df':
movieId tag
260 Best movie ever
1240 scifi
2133 Best movie ever
1097 scifi
260 scifi
250 scifi
By using .crosstab() function:
pd.crosstab(df.movieId, df.tag, dropna = False)
The output will be like this:
tag Bestmovie ever scifi
movieId
250 0 1
260 1 1
1097 0 1
1240 0 1
2133 1 0
Upvotes: 3
Reputation: 674
i hope this fixes the problem
import pandas as pd
import numpy as np
df = pd.DataFrame([[260, "best"],[520,"sci"],[260,"sci"]],columns=['movieId','tag'])
print("Dummy DataFrame: \n", df)
movieId, tags= list(df['movieId'].unique()), list(df['tag'].unique())
dfmatrix= pd.DataFrame(np.zeros((len(movieId),len(tags)+1),dtype=int), columns=['movieID']+tags)
# dfmatrix['movieID'][1]= 54
for i, movie in enumerate(movieId):
listoftag = df.tag[df['movieId']==movie]
dfmatrix.movieID[i]= movie
for tag in listoftag:
dfmatrix[tag][i]=1
print("\n \n dfmatrix \n",dfmatrix)
the output is :
Dummy DataFrame:
movieId tag
0 260 best
1 520 sci
2 260 sci
dfmatrix
movieID best sci
0 260 1 1
1 520 0 1
Upvotes: 1