Reputation: 1385
Suppose I have the following dataframes in pySpark:
df1 = sqlContext.createDataFrame([Row(name='john', age=50), Row(name='james', age=25)])
df2 = sqlContext.createDataFrame([Row(name='john', weight=150), Row(name='mike', weight=115)])
df3 = sqlContext.createDataFrame([Row(name='john', age=50, weight=150), Row(name='james', age=25, weight=None), Row(name='mike', age=None, weight=115)])
Now suppose I want to create df3
from joining/merging df1
and df2
.
I tried doing
df1.join(df2, df1.name == df2.name, 'outer')
This doesn't quite work exactly because it produces two name columns. I need to then somehow combine the two name columns so that missing names from one name column are filled in by the missing name from the other name column.
How would I do that? Or is there a better way to create df3
from df1
and df2
?
Upvotes: 7
Views: 12395
Reputation: 836
df3 = df1.join(df2, ['name'], 'outer')
Joining in this way will prevent the duplication of the name column. https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html
Upvotes: 1
Reputation: 2608
This is a little late, but there is a simpler solution if someone needs it. Just a simple change from original poster's solution:
df1.join(df2, 'name', 'outer')
Upvotes: 5
Reputation: 330453
You can use coallesce
function which returns the first not-null argument.
from pyspark.sql.functions import coalesce
df1 = df1.alias("df1")
df2 = df2.alias("df2")
(df1.join(df2, df1.name == df2.name, 'outer')
.withColumn("name_", coalesce("df1.name", "df2.name"))
.drop("name")
.withColumnRenamed("name_", "name"))
Upvotes: 12