Reputation: 43
I have the following two PySpark dataframes:
df1:
column1 | column2 | column3 | column4 |
---|---|---|---|
some value | some value | some value | some value1 |
df2:
column1 | column2 | column3 | column5 |
---|---|---|---|
some value | some value | some value | some value2 |
When I try to perform an outer join on them on column1
, column2
, and column3
, I get this:
column1 | column2 | column3 | column4 | column5 |
---|---|---|---|---|
some value | some value | some value | NULL | some value2 |
some value | some value | some value | some value1 | NULL |
But I want the following output:
column1 | column2 | column3 | column4 | column5 |
---|---|---|---|---|
some value | some value | some value | some value1 | some value2 |
How can I do this?
Upvotes: 1
Views: 66
Reputation: 2157
As you haven't shared your code - so not sure what you are doing wrong - because I am getting your desired output with the following code:
joined_df = df1.join(df2, ["column1", "column2", "column3"], how="outer")
joined_df.show()
Output:
+----------+----------+----------+-----------+-----------+
| column1| column2| column3| column4| column5|
+----------+----------+----------+-----------+-----------+
|some value|some value|some value|some value1|some value2|
+----------+----------+----------+-----------+-----------+
Upvotes: 0
Reputation: 419
What you are looking for is a left
not a full_outer
join.
The following code should work for your expected result.
from pyspark.sql import Row
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType
data_df1 = [
Row(
column1="some value",
column2="some value",
column3="some value",
column4="some value1"
),
]
schema_df1 = StructType([
StructField(name="column1", dataType=StringType(), nullable=True),
StructField(name="column2", dataType=StringType(), nullable=True),
StructField(name="column3", dataType=StringType(), nullable=True),
StructField(name="column4", dataType=StringType(), nullable=True),
])
data_df2 = [
Row(
column1="some value",
column2="some value",
column3="some value",
column5="some value2"
),
]
schema_df2 = StructType([
StructField(name="column1", dataType=StringType(), nullable=True),
StructField(name="column2", dataType=StringType(), nullable=True),
StructField(name="column3", dataType=StringType(), nullable=True),
StructField(name="column5", dataType=StringType(), nullable=True),
])
df1 = spark.createDataFrame(data=data_df1, schema=schema_df1)
df2 = spark.createDataFrame(data=data_df2, schema=schema_df2)
df_res = df1.join(df2, on=["column1", "column2", "column3"], how="left")
df_res.show()
+----------+----------+----------+-----------+-----------+
| column1| column2| column3| column4| column5|
+----------+----------+----------+-----------+-----------+
|some value|some value|some value|some value1|some value2|
+----------+----------+----------+-----------+-----------+
Upvotes: 0