Reputation: 4842
I am new to pandas and trying to join dataframe values based on keys.
My dataframe looks like this:
Slug Position Brand Session Transactions Ecommerce CTR Click
A 0 aaa
A 1 bbb
A 2 ccc
A 3 ddd
B 0 bbb
B 1 ccc
B 2 ddd
B 3 eee
C 0 aaa
C 1 ccc
C 2 ddd
A 70 100 500
A abc fgh
B 60 900 400
B abd fgj
C 50 400 100
C ab fp
I am trying to make my output to be the following:
Output:
Slug Position Brand Session Transactions Ecommerce CTR Click
A 0 aaa 70 100 500 abc fgh
A 1 bbb
A 2 ccc
A 3 ddd
B 0 bbb 60 900 400 abd fgj
B 1 ccc
B 2 ddd
B 3 eee
C 0 aaa 50 400 100 ab fp
C 1 ccc
C 2 ddd
Dataset:
df = pd.DataFrame({'Slug': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'A', 'B', 'B', 'C', 'C'],
'Position': ['0', '1', '2', '3', '0', '1', '2', '3', '1', '2', '3', '', '', '', '', '', ''],
'Brand': ['aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'bbb', 'ccc', '', '', '', '', '', ''],
'Session': ['', '', '', '', '', '', '', '', '', '', '', '70', '', '60', '', '50', ''],
'Transaction': ['', '', '', '', '', '', '', '', '', '', '', '80', '', '50', '', '40', ''],
'Ecommerce': ['', '', '', '', '', '', '', '', '', '', '', '700', '', '600', '', '500', ''],
'CTR': ['', '', '', '', '', '', '', '', '', '', '', '', 'abc', '', 'abd', '', 'ffp'],
'Click': ['', '', '', '', '', '', '', '', '', '', '', '', 'ab', '', 'fgh', '', 'fp']})
So simply joining the values based on their first key matched.
I am not sure if I should use join or merge because the data is in the same dataframe, not two different ones.
I've tried this
df.set_index('Slug').join(df.set_index('Slug'))
but get this error:
ValueError: columns overlap but no suffix specified: Index(['Position', 'Brand', 'Sessions', 'Transactions', 'Ecommerce CR', 'CTR',
'All clickouts'],
dtype='object')
How would I move on forward trying to get my desired outcome? Thank you for your suggestions.
Upvotes: 3
Views: 158
Reputation: 42916
First we use GroupBy.bfill
to get the values of each group on the first row.
Then get the rows with the lowest Position
and keep those values and set everything else to NaN
.
Finally we remove the rows we dont need by filter on Position not NaN
:
df = df.replace('', np.NaN).groupby('Slug').apply(lambda x: x.bfill())
# df['Position'] = pd.to_numeric(df['Position']) --> Use this line if Position column is not numeric
df.loc[:, 'Session':] = (
df.loc[:, 'Session':].where(df['Position'] == df.groupby('Slug')['Position'].transform('min'))
)
df = df[df['Position'].notna()].replace(np.NaN, '')
Slug Position Brand Session Transaction Ecommerce CTR Click
0 A 0.0 aaa 70 80 700 abc fgh
1 A 1.0 bbb
2 A 2.0 ccc
3 A 3.0 ddd
4 B 0.0 aaa 60 50 600 abd fgj
5 B 1.0 bbb
6 B 2.0 ccc
7 B 3.0 ddd
8 C 1.0 aaa 50 40 500 ab fp
9 C 2.0 bbb
10 C 3.0 ccc
Explanation:
With .loc
you can select a slice of your dataframe to enable yourself to alternate that slice. In this case we select df.loc[ all rows, all columns from 'Session' and onward]
:
df.loc[:, 'Session':]
Session Transaction Ecommerce CTR Click
0 70 80 700 abc fgh
1 70 80 700 abc fgh
2 70 80 700 abc fgh
3 70 80 700 abc fgh
4 60 50 600 abd fgj
5 60 50 600 abd fgj
6 60 50 600 abd fgj
7 60 50 600 abd fgj
8 50 40 500 ab fp
9 50 40 500 ab fp
10 50 40 500 ab fp
11 70 80 700 abc fgh
12 NaN NaN NaN abc fgh
13 60 50 600 abd fgj
14 NaN NaN NaN abd fgj
15 50 40 500 ab fp
16 NaN NaN NaN ab fp
GroupBy.transform('min')
gives us an array back with the same length as our dataframe and on each row the lowest value of Position
per group of Slug
:
df.groupby('Slug')['Position'].transform('min')
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 1.0
9 1.0
10 1.0
11 0.0
12 0.0
13 0.0
14 0.0
15 1.0
16 1.0
Name: Position, dtype: float64
Upvotes: 1
Reputation: 1065
Not the most elegant way but should work: make separate DFs for three slices, and inner merge them, then concat with the required empty rows. UPD: sorry, there was an error with df_dupl, corrected.
df1 = df.dropna(subset=['Position'])
df2 = df.dropna(subset=['Session'])
df3 = df.dropna(subset=['CTR'])
# Keep df1 duplicated all but first in each group.
df_dupl = df[df.duplicated('wind speed', keep=False) & ~df.duplicated('wind speed', keep='first')]
df_res = df1.merge(df2, on=['Slug'], how='inner').merge(df3, on=['Slug'], how='inner')
df_res = pd.concat([df_res, df_dupl]).sort_values(['Slug', 'Position'])
Upvotes: 1