Reputation: 77
I have a question about merging multiple dataframes and adding a column of dummies.
Now I have two original input dataframes. The first dataframe is answering the question "which color is your favorite?" The second dataframe is answering the question "On a scale from 1 to 7, to what extent do you dislike this color?"
df1 = pd.DataFrame({'id': ['01','02'],
'like_wave_1': ['red','red'],
'like_wave_2': ['red','yellow']})
print(df1)
df2 = pd.DataFrame({'id': ['01','02'],
'dislike_wave1_yellow': ['7','2'],
'dislike_wave1_red':['1','1'],
'dislike_wave1_blue':['2','7'],
'dislike_wave2_yellow': ['7','1'],
'dislike_wave2_red':['1','2'],
'dislike_wave2_blue':['3','7']})
print(df2)
The following dataframe constructs the outline of my expected output dataframe.
list_id = ['01','02']
list_color = ['yellow','red','blue']
list_wave = ['1','2']
expand = list(product(list_id, list_color, list_wave))
df = pd.DataFrame.from_records(expand, columns=['id', 'color', 'wave'])
print(df)
id color wave
0 01 yellow 1
1 01 yellow 2
2 01 red 1
3 01 red 2
4 01 blue 1
5 01 blue 2
6 02 yellow 1
7 02 yellow 2
8 02 red 1
9 02 red 2
10 02 blue 1
11 02 blue 2
I would like to add two columns to df:
(1) "like": a column to show the presence of whether a color has shown for a specific id in a specific wave (1 represents yes, and 0 represents no)
(2) "dislike".
Therefore my expected dataframe is:
id color wave like dislike
0 01 yellow 1 0 7
1 01 yellow 2 0 7
2 01 red 1 1 1
3 01 red 2 1 1
4 01 blue 1 0 2
5 01 blue 2 0 3
6 02 yellow 1 0 2
7 02 yellow 2 1 1
8 02 red 1 1 1
9 02 red 2 0 2
10 02 blue 1 0 7
11 02 blue 2 0 7
Could you please help me solve this problem? Thank you so much for answering!
Upvotes: 1
Views: 660
Reputation: 35676
Try to get both frames into a compatible format with the other:
DF1
# Get df1 into usable format
df1 = df1.set_index('id')
# Create Multi Index by splitting columns on '_'
df1.columns = df1.columns.str.split('_', expand=True)
# Stack to create long format frame
df1 = df1.stack().reset_index()
# Fix column names to match df2/output
df1.columns = ['id', 'wave', 'color']
# Set like to 1 for these since this table indicates likes
df1['like'] = 1
df1
:
id wave color like
0 01 1 red 1
1 01 2 red 1
2 02 1 red 1
3 02 2 yellow 1
DF2
# Get df2 into usable format
# Set index to ID
df2 = df2.set_index('id')
# Create Multi Index by splitting columns on '_'
df2.columns = df2.columns.str.split('_', expand=True)
# Stack to create long format frame
df2 = df2.stack(level=[1, 2]).reset_index()
# Fix column names to match df1
df2.columns = ['id', 'wave', 'color', 'dislike']
# Turn "wave1" into 1, "wave2" into 2, ... etc.
df2['wave'] = df2['wave'].str.lstrip('wave')
df2
:
id wave color dislike
0 01 1 blue 2
1 01 1 red 1
2 01 1 yellow 7
3 01 2 blue 3
4 01 2 red 1
5 01 2 yellow 7
6 02 1 blue 7
7 02 1 red 1
8 02 1 yellow 2
9 02 2 blue 7
10 02 2 red 2
11 02 2 yellow 1
Then merge
the frames together:
# Merge On Common Columns
df3 = df1.merge(df2, on=['id', 'wave', 'color'], how='outer')
# Fill empty values in like and dislike with 0 (only 1s in source DF1)
# (Fill dislikes in case there are likes in df1 that are not dislikes in df2)
df3[['like', 'dislike']] = df3[['like', 'dislike']].fillna(0).astype(int)
# Sort Values and fix index (to match output in question)
df3 = df3.sort_values(
['id', 'color'], ascending=[True, False]
).reset_index(drop=True)
df3
:
id wave color like dislike
0 01 1 yellow 0 7
1 01 2 yellow 0 7
2 01 1 red 1 1
3 01 2 red 1 1
4 01 1 blue 0 2
5 01 2 blue 0 3
6 02 1 yellow 0 2
7 02 2 yellow 1 1
8 02 1 red 1 1
9 02 2 red 0 2
10 02 1 blue 0 7
11 02 2 blue 0 7
Upvotes: 4
Reputation: 28709
We can use pivot_longer from pyjanitor to reshape the individual dataframes before merging:
left = (df1.pivot_longer('id',
names_to=('.value', 'num'),
names_pattern=r".+_(.+)_(\d$)")
.rename(columns={"wave":"color",
"num":"wave"})
.assign(like = 1)
)
left
id wave color like
0 01 1 red 1
1 02 1 red 1
2 01 2 red 1
3 02 2 yellow 1
right = (df2.pivot_longer('id',
names_to=(".value", "dislike", "color"),
names_pattern = r".+_(.+)(\d)_(.+)",
sort_by_appearance=True)
.rename(columns = {"dislike":"wave", "wave":"dislike"})
)
right
id wave color dislike
0 01 1 yellow 7
1 01 1 red 1
2 01 1 blue 2
3 01 2 yellow 7
4 01 2 red 1
5 01 2 blue 3
6 02 1 yellow 2
7 02 1 red 1
8 02 1 blue 7
9 02 2 yellow 1
10 02 2 red 2
11 02 2 blue 7
right.merge(left, how = 'outer').fillna(0)
id wave color dislike like
0 01 1 yellow 7 0.0
1 01 1 red 1 1.0
2 01 1 blue 2 0.0
3 01 2 yellow 7 0.0
4 01 2 red 1 1.0
5 01 2 blue 3 0.0
6 02 1 yellow 2 0.0
7 02 1 red 1 1.0
8 02 1 blue 7 0.0
9 02 2 yellow 1 1.0
10 02 2 red 2 0.0
11 02 2 blue 7 0.0
Upvotes: 1