Milen Kovachev
Milen Kovachev

Reputation: 5381

Where is the union() method on the Spark DataFrame class?

I am using the Java connector for Spark and would like to union two DataFrames but bizarrely the DataFrame class has only unionAll? Is this intentional and is there a way to union two DataFrames without duplicates?

Upvotes: 7

Views: 25793

Answers (1)

zero323
zero323

Reputation: 330343

Is this intentional

If think it is safe to assume that it is intentional. Other union operators like RDD.union and DataSet.union will keep duplicates as well.

If you think about it make sense. While operation equivalent to UNION ALL is just a logical operation which requires no data access or network traffic finding distinct elements requires shuffle and because of that can be quite expensive.

is there a way to union two DataFrames without duplicates?

df1.unionAll(df2).distinct()

Upvotes: 19

Related Questions