PRuss
PRuss

Reputation: 43

Joining two PySpark dataframes without duplicating rows

I have the following two PySpark dataframes:

df1:

column1 column2 column3 column4
some value some value some value some value1

df2:

column1 column2 column3 column5
some value some value some value some value2

When I try to perform an outer join on them on column1, column2, and column3, I get this:

column1 column2 column3 column4 column5
some value some value some value NULL some value2
some value some value some value some value1 NULL

But I want the following output:

column1 column2 column3 column4 column5
some value some value some value some value1 some value2

How can I do this?

Upvotes: 1

Views: 66

Answers (2)

Vikas Sharma
Vikas Sharma

Reputation: 2157

As you haven't shared your code - so not sure what you are doing wrong - because I am getting your desired output with the following code:

joined_df = df1.join(df2, ["column1", "column2", "column3"], how="outer")
joined_df.show()

Output:

+----------+----------+----------+-----------+-----------+
|   column1|   column2|   column3|    column4|    column5|
+----------+----------+----------+-----------+-----------+
|some value|some value|some value|some value1|some value2|
+----------+----------+----------+-----------+-----------+

Upvotes: 0

Samuel Demir
Samuel Demir

Reputation: 419

What you are looking for is a left not a full_outer join.

The following code should work for your expected result.

from pyspark.sql import Row
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType

data_df1 = [
    Row(
        column1="some value",
        column2="some value",
        column3="some value",
        column4="some value1"
    ),
]

schema_df1 = StructType([
    StructField(name="column1", dataType=StringType(), nullable=True),
    StructField(name="column2", dataType=StringType(), nullable=True),
    StructField(name="column3", dataType=StringType(), nullable=True),
    StructField(name="column4", dataType=StringType(), nullable=True),
])

data_df2 = [
    Row(
        column1="some value",
        column2="some value",
        column3="some value",
        column5="some value2"
    ),
]

schema_df2 = StructType([
    StructField(name="column1", dataType=StringType(), nullable=True),
    StructField(name="column2", dataType=StringType(), nullable=True),
    StructField(name="column3", dataType=StringType(), nullable=True),
    StructField(name="column5", dataType=StringType(), nullable=True),
])

df1 = spark.createDataFrame(data=data_df1, schema=schema_df1)
df2 = spark.createDataFrame(data=data_df2, schema=schema_df2)

df_res = df1.join(df2, on=["column1", "column2", "column3"], how="left")

df_res.show()

+----------+----------+----------+-----------+-----------+
|   column1|   column2|   column3|    column4|    column5|
+----------+----------+----------+-----------+-----------+
|some value|some value|some value|some value1|some value2|
+----------+----------+----------+-----------+-----------+

Upvotes: 0

Related Questions