Mithril
Mithril

Reputation: 13808

spark how to merge two dataframe on several columns?

I have two dataframe

a may have some ['q1', 'q2', 'q3'] not in b , b also may have some ['q1', 'q2', 'q3'] not in a .

Merge means, if a have ['q1', 'q2', 'q3'] same as b, join the rows , then union left rows . full outer join doesn't do such thing.

my presudo code is like :

c = a.join(b, on= ['q1', 'q2', 'q3'], how='inner')
c = c.union(a.filter( ~a.withColumn('xxx', F.concat_ws('|', 'q1', 'q2', 'q3') ).isin(c.select(F.concat_ws('|', 'q1', 'q2', 'q3')) )
c = c.union(b.filter( ~b.withColumn('xxx', F.concat_ws('|', 'q1', 'q2', 'q3') ).isin(c.select(F.concat_ws('|', 'q1', 'q2', 'q3')) )

But this is very unefficient.

Is there any better way ?

Upvotes: 0

Views: 407

Answers (2)

Mithril
Mithril

Reputation: 13808

I forgot to resolve this question .

Actually full outer join is the correct way. But there is a relevant bug in spark which marked as solved in at version 2.1 , but I meet in 2.4.0.

Unexpected result when inner join before full outer join :

df1 = spark.parquet.read(...)
df2 = spark.parquet.read(...)
df3 = spark.parquet.read(...)
df4 = spark.parquet.read(...)
df5 = spark.parquet.read(...)

store_product =  spark.parquet.read(...)

df1 = df1.join(store_product, on=['store_id', 'product_id'], how='inner')

dfs = [df1,df2,df3,df4,df5,]

df = full_outer_join_all(dfs)

Move inner join behind full outer correct :

df1 = spark.parquet.read(...)
df2 = spark.parquet.read(...)
df3 = spark.parquet.read(...)
df4 = spark.parquet.read(...)
df5 = spark.parquet.read(...)

store_product =  spark.parquet.read(...)


dfs = [df1,df2,df3,df4,df5,]

df = full_outer_join_all(dfs)

df = df.join(store_product, on=['store_id', 'product_id'], how='inner')

PS: this didn't happen on all dataset, I created some data for testing but can not reproduce.

Upvotes: 1

pissall
pissall

Reputation: 7419

I'm assuming you want a left-outer join which can be achieved by:

df1.join(df2, on=['q1', 'q2', 'q3'], "left_outer")

enter image description here

Upvotes: 0

Related Questions