Reputation: 13808
I have two dataframe
a may have some ['q1', 'q2', 'q3'] not in b , b also may have some ['q1', 'q2', 'q3'] not in a .
Merge means, if a have ['q1', 'q2', 'q3'] same as b, join the rows , then union left rows . full outer join
doesn't do such thing.
my presudo code is like :
c = a.join(b, on= ['q1', 'q2', 'q3'], how='inner')
c = c.union(a.filter( ~a.withColumn('xxx', F.concat_ws('|', 'q1', 'q2', 'q3') ).isin(c.select(F.concat_ws('|', 'q1', 'q2', 'q3')) )
c = c.union(b.filter( ~b.withColumn('xxx', F.concat_ws('|', 'q1', 'q2', 'q3') ).isin(c.select(F.concat_ws('|', 'q1', 'q2', 'q3')) )
But this is very unefficient.
Is there any better way ?
Upvotes: 0
Views: 407
Reputation: 13808
I forgot to resolve this question .
Actually full outer join
is the correct way.
But there is a relevant bug in spark which marked as solved in at version 2.1 , but I meet in 2.4.0.
Unexpected result when inner join before full outer join :
df1 = spark.parquet.read(...)
df2 = spark.parquet.read(...)
df3 = spark.parquet.read(...)
df4 = spark.parquet.read(...)
df5 = spark.parquet.read(...)
store_product = spark.parquet.read(...)
df1 = df1.join(store_product, on=['store_id', 'product_id'], how='inner')
dfs = [df1,df2,df3,df4,df5,]
df = full_outer_join_all(dfs)
Move inner join behind full outer correct :
df1 = spark.parquet.read(...)
df2 = spark.parquet.read(...)
df3 = spark.parquet.read(...)
df4 = spark.parquet.read(...)
df5 = spark.parquet.read(...)
store_product = spark.parquet.read(...)
dfs = [df1,df2,df3,df4,df5,]
df = full_outer_join_all(dfs)
df = df.join(store_product, on=['store_id', 'product_id'], how='inner')
PS: this didn't happen on all dataset, I created some data for testing but can not reproduce.
Upvotes: 1
Reputation: 7419
I'm assuming you want a left-outer
join which can be achieved by:
df1.join(df2, on=['q1', 'q2', 'q3'], "left_outer")
Upvotes: 0