Reputation: 69
I have a RDD of this form:
org.apache.spark.rdd.RDD[(String, Int, Array[String])]
This is the first element of the RDD:
(001, 5, Array(a, b, c))
And I want to split that list on several columns, as it is separated by commas, the expected output would be:
(001, 5, a, b, c)
Any help?
SOLUTION:
I finally resolved the problem: What I did was compose the array in a entire string with: mkstring(",") and then, converted the rdd to dataframe. With that, I was able to split the string in columns with the method withColumns
Upvotes: 1
Views: 1751
Reputation: 13985
If you have something like this,
RDD[(String, Int, List[String])]
In general you should not try to generate an RDD with elements of that List as columns.
The reason being the fact that Scala is a Strictly Typed language and your RDD[T]
needs to be a RDD
of type T
.
Now lets say your RDD only had following two "rows" (elements) with lists of different lengths,
("001", 5, List("a", "b", "c"))
("002", 5, List("a", "b", "c", "d"))
Now as you can see... that the first row will need a RDD[(String, Int, String, String, String)]
but the second will need a RDD[(String, Int, String, String, String, String)]
.
This will result in the generated RDD to think of its type as Any
and you will have an RDD[Any]
. And this Any
type will further restrict you in doing things because of Erasure
at run-time.
But the special case where, you can do this without problem is - if you know that each list has known and same
length (lets say 3 in this case),
val yourRdd = rdd.map({
case (s, i, s1 :: s2 :: s3 :: _) => (s, i, s1, s2, s3)
})
Now... If it is not this special case and your lists can have different unknown sizes... and if even you want to do that... converting a list of unspecified length to tuple is not an easy thing to do. At least, I can not think of any easy way to do that.
And I will advise you to avoid trying to do that without a very very solid reason.
Upvotes: 1
Reputation: 1082
I think you just need to get values from the list one by one and put them into a tuple. Try this
val result = RDD.map(x => (x._1, x._2, x._3(0), x._3(1), x._3(2)))
Upvotes: 1