Jonas Palačionis
Jonas Palačionis

Reputation: 4842

Joining pandas dataframe values on key

I am new to pandas and trying to join dataframe values based on keys.

My dataframe looks like this:

Slug  Position   Brand    Session   Transactions   Ecommerce   CTR   Click
A     0          aaa
A     1          bbb
A     2          ccc
A     3          ddd
B     0          bbb
B     1          ccc
B     2          ddd
B     3          eee
C     0          aaa
C     1          ccc
C     2          ddd
A                          70        100             500
A                                                               abc    fgh
B                          60        900             400
B                                                               abd    fgj
C                          50        400             100
C                                                               ab     fp


I am trying to make my output to be the following:

Output:

Slug  Position   Brand    Session   Transactions   Ecommerce   CTR   Click
A     0          aaa      70        100            500         abc   fgh
A     1          bbb
A     2          ccc
A     3          ddd
B     0          bbb      60        900             400        abd    fgj
B     1          ccc
B     2          ddd
B     3          eee
C     0          aaa      50        400             100         ab     fp
C     1          ccc
C     2          ddd

Dataset:

df = pd.DataFrame({'Slug': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'A', 'B', 'B', 'C', 'C'],
                   'Position': ['0', '1', '2', '3', '0', '1', '2', '3', '1', '2', '3', '', '', '', '', '', ''],
                   'Brand': ['aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'bbb', 'ccc', 'ddd', 'aaa', 'bbb', 'ccc', '', '', '', '', '', ''],
                   'Session': ['', '', '', '', '', '', '', '', '', '', '',  '70', '', '60', '', '50', ''],
                   'Transaction': ['', '', '', '', '', '', '', '', '', '', '', '80', '', '50', '', '40', ''],
                   'Ecommerce': ['', '', '', '', '', '', '', '', '', '', '', '700', '', '600', '', '500', ''],
                   'CTR': ['', '', '', '', '', '', '', '', '', '', '', '', 'abc', '', 'abd', '', 'ffp'],
                   'Click': ['', '', '', '', '', '', '', '', '', '', '', '', 'ab', '', 'fgh', '', 'fp']})

So simply joining the values based on their first key matched.

I am not sure if I should use join or merge because the data is in the same dataframe, not two different ones.

I've tried this

df.set_index('Slug').join(df.set_index('Slug'))

but get this error:

ValueError: columns overlap but no suffix specified: Index(['Position', 'Brand', 'Sessions', 'Transactions', 'Ecommerce CR', 'CTR',
       'All clickouts'],
      dtype='object')

How would I move on forward trying to get my desired outcome? Thank you for your suggestions.

Upvotes: 3

Views: 158

Answers (2)

Erfan
Erfan

Reputation: 42916

First we use GroupBy.bfill to get the values of each group on the first row.

Then get the rows with the lowest Position and keep those values and set everything else to NaN.

Finally we remove the rows we dont need by filter on Position not NaN:

df = df.replace('', np.NaN).groupby('Slug').apply(lambda x: x.bfill())

# df['Position'] = pd.to_numeric(df['Position']) --> Use this line if Position column is not numeric
df.loc[:, 'Session':] = (
    df.loc[:, 'Session':].where(df['Position'] == df.groupby('Slug')['Position'].transform('min'))
)

df = df[df['Position'].notna()].replace(np.NaN, '')
   Slug  Position Brand Session Transaction Ecommerce  CTR Click
0     A       0.0   aaa      70          80       700  abc   fgh
1     A       1.0   bbb                                         
2     A       2.0   ccc                                         
3     A       3.0   ddd                                         
4     B       0.0   aaa      60          50       600  abd   fgj
5     B       1.0   bbb                                         
6     B       2.0   ccc                                         
7     B       3.0   ddd                                         
8     C       1.0   aaa      50          40       500   ab    fp
9     C       2.0   bbb                                         
10    C       3.0   ccc                                         

Explanation:

With .loc you can select a slice of your dataframe to enable yourself to alternate that slice. In this case we select df.loc[ all rows, all columns from 'Session' and onward]:

df.loc[:, 'Session':]

   Session Transaction Ecommerce  CTR Click
0       70          80       700  abc   fgh
1       70          80       700  abc   fgh
2       70          80       700  abc   fgh
3       70          80       700  abc   fgh
4       60          50       600  abd   fgj
5       60          50       600  abd   fgj
6       60          50       600  abd   fgj
7       60          50       600  abd   fgj
8       50          40       500   ab    fp
9       50          40       500   ab    fp
10      50          40       500   ab    fp
11      70          80       700  abc   fgh
12     NaN         NaN       NaN  abc   fgh
13      60          50       600  abd   fgj
14     NaN         NaN       NaN  abd   fgj
15      50          40       500   ab    fp
16     NaN         NaN       NaN   ab    fp

GroupBy.transform('min') gives us an array back with the same length as our dataframe and on each row the lowest value of Position per group of Slug:

df.groupby('Slug')['Position'].transform('min')

0     0.0
1     0.0
2     0.0
3     0.0
4     0.0
5     0.0
6     0.0
7     0.0
8     1.0
9     1.0
10    1.0
11    0.0
12    0.0
13    0.0
14    0.0
15    1.0
16    1.0
Name: Position, dtype: float64

Upvotes: 1

Oleg O
Oleg O

Reputation: 1065

Not the most elegant way but should work: make separate DFs for three slices, and inner merge them, then concat with the required empty rows. UPD: sorry, there was an error with df_dupl, corrected.

df1 = df.dropna(subset=['Position'])
df2 = df.dropna(subset=['Session'])
df3 = df.dropna(subset=['CTR'])

# Keep df1 duplicated all but first in each group.
df_dupl = df[df.duplicated('wind speed', keep=False) & ~df.duplicated('wind speed', keep='first')]    

df_res = df1.merge(df2, on=['Slug'], how='inner').merge(df3, on=['Slug'], how='inner')
df_res = pd.concat([df_res, df_dupl]).sort_values(['Slug', 'Position'])

Upvotes: 1

Related Questions