Array to Tuple in Spark with many input variables

Question

Lets say I am importing a flat file from HDFS into spark using something like the following:

val data = sc.textFile("hdfs://name_of_file.tsv").map(_.split('	'))

This will produce an Array[Array[String]]. If I wanted an array of tuples I could do as referenced in this solution and map the elements to a tuple.

val dataToTuple = data.map{ case Array(x,y) => (x,y) }

But what if my input data has say, 100 columns? Is there a way in scala using some sort of wildcard to say

val dataToTuple = data.map{ case Array(x,y, ... ) => (x,y, ...) }

without having to write out 100 variable to match on?

I tried doing something like

val dataToTuple = data.map{ case Array(_) => (_) }

but that didn't seem to make much sense.

dk14 · Accepted Answer

If your data-columns are homogenous (like Array of Strings) - tuple may not be a best solution to improve type-safety. All you can do is to fix the size of your array using sized list from Shapeless library:

How to require typesafe constant-size array in scala?

This is a right approach if your column's are unnamed. For instance, your row might be a representation of a vector in Euclidean space.

Otherwise (named columns, maybe different types), it's better to model it with a case class, but be aware of size restriction. This might help you to quickly map array (or its parts) to ADT: https://stackoverflow.com/a/19901310/1809978

Array to Tuple in Spark with many input variables

Answers (1)

Related Questions