Reputation: 11833
I'm trying to join two tables in spark sql. Each table has 50+ columns. Both has column id
as the key.
spark.sql("select * from tbl1 join tbl2 on tbl1.id = tbl2.id")
The joined table has duplicated id
column.
We can of course specify which id
column to keep like below:
spark.sql("select tbl1.id, .....from tbl1 join tbl2 on tbl1.id = tbl2.id")
But since we have so many columns in both tables, I do not want to type all the other column names in the query above. (other than id
column, no other duplicated column names).
what should I do? thanks.
Upvotes: 2
Views: 232
Reputation: 306
val joined = spark
.sql("select * from tbl1")
.join(
spark.sql("select * from tbl2"),
Seq("id"),
"inner" // optional
)
joined
should have only one id
column. Tested with Spark 2.4.8
Upvotes: 0
Reputation: 381
Assuming, you want to preserve the "duplicates", you can try to use the internal row-id or equivalents for your help. This helped me in the past, if I had to delete exactly one of two identical rows.
select *,ctid from table;
outputs in postgresql also the internal counter id. Your before exact identical rows become different now. I don't know about spark.sql, but I assume, that you can access a similar attribute there.
Upvotes: 0
Reputation: 1271151
If id
is the only column name in common, you can take advantage of the USING
clause:
spark.sql("select * from tbl1 join tbl2 using (id) ")
The using
clause matches columns that have the same name in both tables. When using select *
, the column appears only once.
Upvotes: 3