Reputation: 18800
I originally started with following data frame:
Dataset is related to users answering multiple questions which have multiple answer choices and the user has the ability to answer more than one answer.
movie_id, user_id, rated_value, question_id, answer_id, genre, user_gender, user_ethnicity
101, 345, 3.5, 1, 1, comedy, male, white
101, 345, 3.5, 1, 2, comedy, male, white
101, 345, 3.5, 2, 1, comedy, male, white
125, 345, 4.5, 1, 4, drama, male, white
101, 233, 4.0, 1, 3, comedy, female, black
101, 233, 4.0, 2, 2, comedy, female, black
125, 233, 3.0, 1, 1, drama, female, black
125, 233, 3.0, 2, 2, drama, female, black
125, 333, 3.0, 1, 1, comedy, male, asian
125, 333, 3.0, 2, 2, comedy, male, asian
I want to flatten this table by pivoting. I can successfully do without bringing in genre, user_gender, user_ethnicity
as follows:
pivoted_df = df_to_pivot.assign(val=1).pivot_table(
index=['movie_id',
'user_id',
'rated_value'],
columns=['question_id',
'answer_id'],
values=['question_id', 'answer_id'],
fill_value=0)
then to combine question and answer id so columns will reflect as 1_1, 1_2
pivoted_df.columns = pivoted_df.columns.droplevel()
pivoted_df.columns = ['{}_{}'.format(l1, l2).strip() for l1, l2 in pivoted_df.columns.values]
pivoted_df = pivoted_df.reset_index()
movie_id user_id rating_value 1_1 1_2 1_3 1_4...
but when tried to add genre, user_gender, user_ethnicity
pivoted_df = df_to_pivot.assign(val=1).pivot_table(
index=['movie_id',
'user_id',
'rated_value'],
columns=['question_id',
'answer_id', 'genre', 'user_gender','user_ethnicity'],
values=['question_id', 'answer_id', 'genre', 'user_gender','user_ethnicity'],
fill_value=0)
it doesn't really work.
My goal is to pivot genre, user_gender, user_ethnicity
just like the rest so columns would be
movie_id user_id rated_value 1_1 1_2 1_3 1_4...comedy, drama...,male, female, black, white, asian
output:
movie_id, user_id, rated_value , 1_1, 1_2, 1_3, 1_4, comedy, drama, male, female, white, black, asian
101, 345, 3.5, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0
Goal is to get movie_id, user_id pair per row and everything else reflect by 1 and zero.
Upvotes: 2
Views: 3916
Reputation: 153460
Combine question_id and answer_id in to one column then use pd.get_dummies
df['QandA'] = df['question_id'].astype(str) + '_' + df['answer_id'].astype(str)
pd.get_dummies(df, columns=['QandA','genre','user_gender','user_ethnicity'])
Output:
movie_id user_id rated_value question_id answer_id QandA_1_1 QandA_1_2 QandA_1_3 QandA_1_4 QandA_2_1 QandA_2_2 genre_comedy genre_drama user_gender_female \
0 101 345 3.5 1 1 1 0 0 0 0 0 1 0 0
1 101 345 3.5 1 2 0 1 0 0 0 0 1 0 0
2 101 345 3.5 2 1 0 0 0 0 1 0 1 0 0
3 125 345 4.5 1 4 0 0 0 1 0 0 0 1 0
4 101 233 4.0 1 3 0 0 1 0 0 0 1 0 1
5 101 233 4.0 2 2 0 0 0 0 0 1 1 0 1
6 125 233 3.0 1 1 1 0 0 0 0 0 0 1 1
7 125 233 3.0 2 2 0 0 0 0 0 1 0 1 1
8 125 333 3.0 1 1 1 0 0 0 0 0 1 0 0
9 125 333 3.0 2 2 0 0 0 0 0 1 1 0 0
user_gender_male user_ethnicity_asian user_ethnicity_black user_ethnicity_white
0 1 0 0 1
1 1 0 0 1
2 1 0 0 1
3 1 0 0 1
4 0 0 1 0
5 0 0 1 0
6 0 0 1 0
7 0 0 1 0
8 1 1 0 0
9 1 1 0 0
I think you need pd.get_dummies
:
pd.get_dummies(df, columns=['genre','user_gender','user_ethnicity'])
Output:
movie_id user_id rated_value question_id answer_id genre_comedy genre_drama user_gender_female user_gender_male user_ethnicity_asian user_ethnicity_black \
0 101 345 3.5 1 1 1 0 0 1 0 0
1 101 345 3.5 1 2 1 0 0 1 0 0
2 101 345 3.5 2 1 1 0 0 1 0 0
3 125 345 4.5 1 4 0 1 0 1 0 0
4 101 233 4.0 1 3 1 0 1 0 0 1
5 101 233 4.0 2 2 1 0 1 0 0 1
6 125 233 3.0 1 1 0 1 1 0 0 1
7 125 233 3.0 2 2 0 1 1 0 0 1
8 125 333 3.0 1 1 1 0 0 1 1 0
9 125 333 3.0 2 2 1 0 0 1 1 0
user_ethnicity_white
0 1
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 0
9 0
Upvotes: 1