Reputation: 2251
I'm trying to test a part of my program which performs transformations on dataframes I want to test several different variations of these dataframe which rules out the option of reading a specific DF from a file
And so my questions are:
I obviously googled that before but could not find anything which was very useful. Among the more useful links I found were:
It would be great if examples/tutorials are in Scala but I'll take whatever language you've got
Thanks in advance
Upvotes: 10
Views: 24906
Reputation: 1751
This link shows how we can programmatically create a data frame with schema. You can keep the data in separate traits and mix it in with your tests. For instance,
// This example assumes CSV data. But same approach should work for other formats as well.
trait TestData {
val data1 = List(
"this,is,valid,data",
"this,is,in-valid,data",
)
val data2 = ...
}
Then with ScalaTest, we can do something like this.
class MyDFTest extends FlatSpec with Matchers {
"method" should "perform this" in new TestData {
// You can access your test data here. Use it to create the DataFrame.
// Your test here.
}
}
To create the DataFrame, you can have few util methods like below.
def schema(types: Array[String], cols: Array[String]) = {
val datatypes = types.map {
case "String" => StringType
case "Long" => LongType
case "Double" => DoubleType
// Add more types here based on your data.
case _ => StringType
}
StructType(cols.indices.map(x => StructField(cols(x), datatypes(x))).toArray)
}
def df(data: List[String], types: Array[String], cols: Array[String]) = {
val rdd = sc.parallelize(data)
val parser = new CSVParser(',')
val split = rdd.map(line => parser.parseLine(line))
val rdd = split.map(arr => Row(arr(0), arr(1), arr(2), arr(3)))
sqlContext.createDataFrame(rdd, schema(types, cols))
}
I am not aware of any utility classes for checking specific values in a DataFrame. But I think it should be simple to write one using the DataFrame APIs.
Upvotes: 8
Reputation: 2737
You could use SharedSQLContext
and SharedSparkSession
that Spark uses for its own unit tests. Check my answer for examples.
Upvotes: 1
Reputation: 790
For those looking to achieve something similar in Java, you can use start by using this project to initialize a SparkContext within your unit tests: https://github.com/holdenk/spark-testing-base
I personally had to mimick the file structure of some AVRO files. So I used Avro-tools (https://avro.apache.org/docs/1.8.2/gettingstartedjava.html#download_install) to extract the schema from my binary records using the following command:
java -jar $AVRO_HOME/avro tojson largeAvroFile.avro | head -3
Then, using this small helper method, you can convert the output JSON into a DataFrame to use in your unit tests.
private DataFrame getDataFrameFromList() {
SQLContext sqlContext = new SQLContext(jsc());
ImmutableList<String> elements = ImmutableList.of(
{"header":{"appId":"myAppId1","clientIp":"10.22.63.3","createdDate":"2017-05-10T02:09:59.984Z"}}
{"header":{"appId":"myAppId1","clientIp":"11.22.63.3","createdDate":"2017-05-11T02:09:59.984Z"}}
{"header":{"appId":"myAppId1","clientIp":"12.22.63.3","createdDate":"2017-05-11T02:09:59.984Z"}}
);
JavaRDD<String> parallelize = jsc().parallelize(elements);
return sqlContext.read().json(parallelize);
}
Upvotes: 0