Obtaining one column of a RDD[Array[String]] and converting it to dataset/dataframe

Question

I have a .csv file that I read in to a RDD:

val dataH = sc.textFile(filepath).map(line => line.split(",").map(elem => elem.trim))

I would like to iterate over this RDD in order and compare adjacent elements, this comparison is only dependent of one column of the datastructure. It is not possible to iterate over RDDs so instead, the idea is to first convert the column of RDD to either a Dataset or Dataframe.

You can convert a RDD to a dataset like this (which doesn't work if my structure is RDD[Array[String]]:

val sc = new SparkContext(conf)  
val sqc = new SQLContext(sc)
import sqc.implicits._
val lines = sqc.createDataset(dataH)

How do I obtain just the one column that I am interested in from dataH and thereafter create a dataset just from it?

I am using Spark 1.6.0.

Raphael Roth · Accepted Answer

You can just map your Array to the desired index, e.g. :

dataH.map(arr => arr(0)).toDF("col1")

Or safer (avoids Exception in case the index is out of bound):

dataH.map(arr => arr.lift(0).orElse(None)).toDF("col1")

Obtaining one column of a RDD[Array[String]] and converting it to dataset/dataframe

Answers (1)

Related Questions