plam
plam

Reputation: 1385

Outer join Spark dataframe with non-identical join column and then merge join column

Suppose I have the following dataframes in pySpark:

df1 = sqlContext.createDataFrame([Row(name='john', age=50), Row(name='james', age=25)])
df2 = sqlContext.createDataFrame([Row(name='john', weight=150), Row(name='mike', weight=115)])
df3 = sqlContext.createDataFrame([Row(name='john', age=50, weight=150), Row(name='james', age=25, weight=None), Row(name='mike', age=None, weight=115)])

Now suppose I want to create df3 from joining/merging df1 and df2.

I tried doing

df1.join(df2, df1.name == df2.name, 'outer')

This doesn't quite work exactly because it produces two name columns. I need to then somehow combine the two name columns so that missing names from one name column are filled in by the missing name from the other name column.

How would I do that? Or is there a better way to create df3 from df1 and df2?

Upvotes: 7

Views: 12395

Answers (3)

Michael Hoffman
Michael Hoffman

Reputation: 836

df3 = df1.join(df2, ['name'], 'outer')

Joining in this way will prevent the duplication of the name column. https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html

Upvotes: 1

V. Samma
V. Samma

Reputation: 2608

This is a little late, but there is a simpler solution if someone needs it. Just a simple change from original poster's solution:

df1.join(df2, 'name', 'outer')

Upvotes: 5

zero323
zero323

Reputation: 330453

You can use coallesce function which returns the first not-null argument.

from pyspark.sql.functions import coalesce

df1 = df1.alias("df1")
df2 = df2.alias("df2")

(df1.join(df2, df1.name == df2.name, 'outer')
  .withColumn("name_", coalesce("df1.name", "df2.name"))
  .drop("name")
  .withColumnRenamed("name_", "name"))

Upvotes: 12

Related Questions