Reputation: 514
Purpose - Checking if a dataframe generated by spark and a manually created dataframe are the same.
Earlier implementation which worked -
if (da.except(ds).count() != 0 && ds.except(da).count != 0)
Boolean returned - true
Where da and ds are the generated dataframe and the created dataframe respectively.
Here I am running the program via the spark-shell.
Newer Implementation which doesn't work -
assert (da.except(ds).count() != 0 && ds.except(da).count != 0)
Boolean returned - false
Where da and ds are the generated dataframe and the created dataframe respectively.
Here I am using the assert method of scalatest instead, but the returned result is not returning as true.
Why try to use the new implementation when previous method worked? To have sbt use scalatest to always run the test file via sbt test
or while compiling.
The same code to compare spark dataframes when run via the spark-shell, gives the correct output but shows an error when run using scalatest in sbt.
The two programs are effectively the same but the results are different. What could be the problem?
Upvotes: 3
Views: 8927
Reputation: 514
I solved the issue by using this as a dependency https://github.com/MrPowers/spark-fast-tests .
Another solution would be to iterate over the members of the dataframe individually and compare them.
Upvotes: 3
Reputation: 7207
Tests for compare dataframes exists in Spark Core, example: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala
Libraries with tests shared code (SharedSQLContext, ect) present in central Maven repo, you can include them in project, and use "checkAnswer" methods for compare dataframes.
Upvotes: 3