Reputation: 877
I am doing join of two data frames having same name and values.
data is:
+----------------+------------------+
|id |name |
+----------------+------------------+
|2 |G2 |
|1 |H2 |
|2 |H2 |
|1 |G2 |
+----------------+------------------+
Code is:
res = data.alias(
"C1"
).join(
data.alias(
"C2"
),
on='id',
how="inner"
).select(
F.col("C1.{0}".format(name)).alias("C1"),
F.col("C2.{0}".format(name)).alias("C2"),
'id'
)
I am getting the result as:
+----+----+----------------+
|C1 |C2 |id |
+----+----+----------------+
|G2 |H2 |2 |
|G2 |G2 |2 |
|H2 |G2 |1 |
|H2 |H2 |1 |
|H2 |H2 |2 |
|H2 |G2 |2 |
|G2 |G2 |1 |
|G2 |H2 |1 |
+----+----+----------------+
Whereas with inner join I expected the result as:
+----------------+------------------+
|id |C1 | C2 |
+----------------+------------------+
|2 |G2 | G2 |
|1 |H2 | H2 |
|2 |H2 | H2 |
|1 |G2 | G2 |
+----------------+------------------+
Why am i getting the result as union with 8 rows and not the above 4 rows.
Upvotes: 0
Views: 2416
Reputation: 42352
Inner join will match all pairs of rows from the two tables which satisfy the given conditions. You asked for rows to be joined whenever their id matches, so the first row will match both the first and the third row, giving two corresponding rows in the resulting dataframe. Similarly, all the other rows will match two other rows with the same id, so at the end you got 8 rows.
If you want your expected result, you do not need a join - simply duplicate the column C1 to C2 using res = data.withColumn("C2", data["C1"]))
.
Upvotes: 2