azal
azal

Reputation: 1260

Pyspark join two dataframes

Assuming I have two dataframes with different levels of information like this:

df1
  Month       Day    Values
   Jan      Monday     65      
   Feb      Monday     66
   Mar      Tuesday    68
   Jun      Monday     58 
    

df2
  Month       Day     Hour
   Jan      Monday     5    
   Jan      Monday     5       
   Jan      Monday     8
   Feb      Monday     9
   Feb      Monday     9
   Feb      Monday     9
   Mar      Tuesday    10
   Mar      Tuesday    1
   Jun      Tuesday    2                 
   Jun      Monday     7             
   Jun      Monday     8        

I want to join df1 with df2 and transfer the 'Value' information to df2: Each hour of day will get the 'Day' value.

Expected output:

   final
      Month       Day     Hour     Value
       Jan      Monday     5         65
       Jan      Monday     5         65
       Jan      Monday     8         65
       Feb      Monday     9         66
       Feb      Monday     9         66
       Feb      Monday     9         66
       Mar      Tuesday    10        68
       Mar      Tuesday    1         68
       Jun      Monday     7         58             
       Jun      Monday     8         58

Upvotes: 0

Views: 117

Answers (1)

mpSchrader
mpSchrader

Reputation: 932

This should be a simple join:

df2 = df2.join(df1, on=['Month', 'Day'], how='inner')

The join will calculate all possible combinations. E.g.,

df1:
   Jan      Monday     65

df2: 
  Month       Day     Hour
   Jan      Monday     5    
   Jan      Monday     5  

Because all entries match on Jan and Monday all possible combinations will be part of the output:

      Month       Day     Hour     Value
       Jan      Monday     5         65
       Jan      Monday     5         65

Note: Whether you join df1 onto df2 or vice versa and whether you use an inner or left join depends on how you want to handle mismatches.

Upvotes: 1

Related Questions