Reputation: 1
I'm new in Scala and Spark and i don't know how to do this.
I have preprocessed a CSV file, resulting in an RDD that contains lists with this format:
List("2014-01-01T23:56:06.0", NaN, 1, NaN)
List("2014-01-01T23:56:06.0", NaN, NaN, 2)
All lists have the same number of elements.
What I want to do is to combine the lists having the same first element (the timestamp). For example, I want this two example lists to produce only one List, with the following values:
List("2014-01-01T23:56:06.0", NaN, 1, 2)
Thanks for your help :)
Upvotes: 0
Views: 118
Reputation: 7207
If array tail values are doubles, can be implemented in this way (as sachav suggests):
val original = sparkContext.parallelize(
Seq(
List("2014-01-01T23:56:06.0", NaN, 1.0, NaN),
List("2014-01-01T23:56:06.0", NaN, NaN, 2.0)
)
)
val result = original
.map(v => v.head -> v.tail)
.reduceByKey(
(acc, curr) => acc.zip(curr).map({ case (left, right) => if (left.asInstanceOf[Double].isNaN) right else left }))
.map(v => v._1 :: v._2)
result.foreach(println)
Output is:
List(2014-01-01T23:56:06.0, NaN, 1.0, 2.0)
Upvotes: 0
Reputation: 196
# Below can help you in achieving your target
val input_rdd1 = spark.sparkContext.parallelize(List(("2014-01-01T23:56:06.0", "NaN", "1", "NaN")))
val input_rdd2 = spark.sparkContext.parallelize(List(("2014-01-01T23:56:06.0", "NaN", "NaN", "2")))
//added one more row for your data
val input_rdd3 = spark.sparkContext.parallelize(List(("2014-01-01T23:56:06.0", "2", "NaN", "NaN")))
val input_df1 = input_rdd1.toDF("col1", "col2", "col3", "col4")
val input_df2 = input_rdd2.toDF("col1", "col2", "col3", "col4")
val input_df3 = input_rdd3.toDF("col1", "col2", "col3", "col4")
val output_df = input_df1.union(input_df2).union(input_df3).groupBy($"col1").agg(min($"col2").as("col2"), min($"col3").as("col3"), min($"col4").as("col4"))
output_df.show
output:
+--------------------+----+----+----+
| col1|col2|col3|col4|
+--------------------+----+----+----+
|2014-01-01T23:56:...| 2| 1| 2|
+--------------------+----+----+----+
Upvotes: 1