osk
osk

Reputation: 810

Obtaining one column of a RDD[Array[String]] and converting it to dataset/dataframe

I have a .csv file that I read in to a RDD:

val dataH = sc.textFile(filepath).map(line => line.split(",").map(elem => elem.trim))

I would like to iterate over this RDD in order and compare adjacent elements, this comparison is only dependent of one column of the datastructure. It is not possible to iterate over RDDs so instead, the idea is to first convert the column of RDD to either a Dataset or Dataframe.

You can convert a RDD to a dataset like this (which doesn't work if my structure is RDD[Array[String]]:

val sc = new SparkContext(conf)  
val sqc = new SQLContext(sc)
import sqc.implicits._
val lines = sqc.createDataset(dataH)

How do I obtain just the one column that I am interested in from dataH and thereafter create a dataset just from it?

I am using Spark 1.6.0.

Upvotes: 0

Views: 811

Answers (1)

Raphael Roth
Raphael Roth

Reputation: 27373

You can just map your Array to the desired index, e.g. :

dataH.map(arr => arr(0)).toDF("col1")

Or safer (avoids Exception in case the index is out of bound):

dataH.map(arr => arr.lift(0).orElse(None)).toDF("col1") 

Upvotes: 1

Related Questions