zesla
zesla

Reputation: 11833

How to avoid key column name duplication in join?

I'm trying to join two tables in spark sql. Each table has 50+ columns. Both has column id as the key.

spark.sql("select * from tbl1 join tbl2 on tbl1.id = tbl2.id")

The joined table has duplicated id column. We can of course specify which id column to keep like below:

spark.sql("select tbl1.id, .....from tbl1 join tbl2 on tbl1.id = tbl2.id")

But since we have so many columns in both tables, I do not want to type all the other column names in the query above. (other than id column, no other duplicated column names).

what should I do? thanks.

Upvotes: 2

Views: 232

Answers (3)

meniluca
meniluca

Reputation: 306

val joined = spark
     .sql("select * from tbl1")
     .join(
          spark.sql("select * from tbl2"),
          Seq("id"),
          "inner" // optional
     )

joined should have only one id column. Tested with Spark 2.4.8

Upvotes: 0

MaestroGlanz
MaestroGlanz

Reputation: 381

Assuming, you want to preserve the "duplicates", you can try to use the internal row-id or equivalents for your help. This helped me in the past, if I had to delete exactly one of two identical rows.

select *,ctid from table;

outputs in postgresql also the internal counter id. Your before exact identical rows become different now. I don't know about spark.sql, but I assume, that you can access a similar attribute there.

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1271151

If id is the only column name in common, you can take advantage of the USING clause:

spark.sql("select * from tbl1 join tbl2 using (id) ")

The using clause matches columns that have the same name in both tables. When using select *, the column appears only once.

Upvotes: 3

Related Questions