Avenger
Avenger

Reputation: 877

Alias inner join in pyspark

I am doing join of two data frames having same name and values.

data is:

+----------------+------------------+
|id              |name              |
+----------------+------------------+
|2               |G2                |
|1               |H2                |
|2               |H2                |
|1               |G2                |
+----------------+------------------+

Code is:

 res = data.alias(
            "C1"
        ).join(
            data.alias(
                "C2"
            ),
            on='id',
            how="inner"
        ).select(
            F.col("C1.{0}".format(name)).alias("C1"),
            F.col("C2.{0}".format(name)).alias("C2"),
            'id'
        )

I am getting the result as:

+----+----+----------------+
|C1  |C2  |id              |
+----+----+----------------+
|G2  |H2  |2               |
|G2  |G2  |2               |
|H2  |G2  |1               |
|H2  |H2  |1               |
|H2  |H2  |2               |
|H2  |G2  |2               |
|G2  |G2  |1               |
|G2  |H2  |1               |
+----+----+----------------+

Whereas with inner join I expected the result as:

+----------------+------------------+
|id              |C1  |   C2        |
+----------------+------------------+
|2               |G2   | G2         |
|1               |H2   | H2         |
|2               |H2   | H2         |
|1               |G2   | G2         |
+----------------+------------------+

Why am i getting the result as union with 8 rows and not the above 4 rows.

Upvotes: 0

Views: 2416

Answers (1)

mck
mck

Reputation: 42352

Inner join will match all pairs of rows from the two tables which satisfy the given conditions. You asked for rows to be joined whenever their id matches, so the first row will match both the first and the third row, giving two corresponding rows in the resulting dataframe. Similarly, all the other rows will match two other rows with the same id, so at the end you got 8 rows.

If you want your expected result, you do not need a join - simply duplicate the column C1 to C2 using res = data.withColumn("C2", data["C1"])).

Upvotes: 2

Related Questions