Pratyush Das
Pratyush Das

Reputation: 514

How to write scala unit tests to compare spark dataframes?

Purpose - Checking if a dataframe generated by spark and a manually created dataframe are the same.

Earlier implementation which worked -

if (da.except(ds).count() != 0 && ds.except(da).count != 0)

Boolean returned - true

Where da and ds are the generated dataframe and the created dataframe respectively.

Here I am running the program via the spark-shell.

Newer Implementation which doesn't work -

assert (da.except(ds).count() != 0 && ds.except(da).count != 0)

Boolean returned - false

Where da and ds are the generated dataframe and the created dataframe respectively.

Here I am using the assert method of scalatest instead, but the returned result is not returning as true.

Why try to use the new implementation when previous method worked? To have sbt use scalatest to always run the test file via sbt test or while compiling.

The same code to compare spark dataframes when run via the spark-shell, gives the correct output but shows an error when run using scalatest in sbt.

The two programs are effectively the same but the results are different. What could be the problem?

Upvotes: 3

Views: 8927

Answers (2)

Pratyush Das
Pratyush Das

Reputation: 514

I solved the issue by using this as a dependency https://github.com/MrPowers/spark-fast-tests .

Another solution would be to iterate over the members of the dataframe individually and compare them.

Upvotes: 3

pasha701
pasha701

Reputation: 7207

Tests for compare dataframes exists in Spark Core, example: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala

Libraries with tests shared code (SharedSQLContext, ect) present in central Maven repo, you can include them in project, and use "checkAnswer" methods for compare dataframes.

Upvotes: 3

Related Questions