Reputation: 434
I have 2 spark dataframes:
Location Date Date_part Sector units
USA 7/1/2021 7/1/2021 Cars 200
IND 7/1/2021 7/1/2021 Scooters 180
COL 7/1/2021 7/1/2021 Trucks 100
Location Date Brands units values
UK null brand1 400 120
AUS null brand2 450 230
CAN null brand3 150 34
after doing unionByName, I got
Location Date Date_part Sector Brands units values
USA 7/1/2021 7/1/2021 Cars 200
IND 7/1/2021 7/1/2021 Scooters 180
COL 7/1/2021 7/1/2021 Trucks 100
UK null null brand1 400 120
AUS null null brand2 450 230
CAN null null brand3 150 34
But my expected dataframe is:
Location Date Date_part Sector Brands units values
USA 7/1/2021 7/1/2021 Cars 200
IND 7/1/2021 7/1/2021 Scooters 180
COL 7/1/2021 7/1/2021 Trucks 100
UK null 7/1/2021 brand1 400 120
AUS null 7/1/2021 brand2 450 230
CAN null 7/1/2021 brand3 150 34
I need the values in date_part column to be of dataframe 1 values for all the rows. I tried with this code:
df_result=df_final.select(df_1['date_part'], df_final["*"])
This is creating an extra column date_part. How to acheive my expected dataframe?
Upvotes: 1
Views: 844
Reputation: 15318
assuming Date_part
is the same for the whole dataframe, there are several solutions to do that - here is one :
from pyspark.sql import functions as F
missing_date = df_result.where(F.col("Date_part").isNotNull()).first()["Date_part"]
df_result = df_result.withColumn("Date_part", F.lit(missing_date))
Upvotes: 2