How to remove all records in a RDD including null?

Question

I loaded an RDD from a csv file. However, this file includes invalid data. So, when I tried to output the contact of this RDD with first. The exception is

Caused by: java.lang.NumberFormatException: empty String

I hope to find solution to remove all records in the RDD when one record includes empty string. In addition, this RDD includes so many fields, so it is difficult to handle every field one by one. I remembers that DataFrame has such function, such as na.drop(). I need that this kind of function will work for RDD.

The code I used is like:

//using case class
case class Flight(dest_id:Long, dest:String, crsdeptime:Double, deptime:Double, depdelaymins:Double, crsarrtime:Double)

//defining function
def parseFlight(str: String): Flight = {
  val line = str.split(",")
  Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5).toLong)
}

//loading data
val textRDD = sc.textFile("/root/data/data.csv")
val flightsRDD = textRDD.map(parseFlight)

update

When I used RDD converted by DateFrame. I found every line of RDD is Row object. How to extract some fields of one Row to build Edge object?

Shaido · Accepted Answer

If the header in the csv file matches the variable names in the case class, then it's easier to read the data as a dataframe and then use na.drop().

val flightsDf = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("/root/data/data.csv")
  .na.drop()
  .as[Flight]

If you want a rdd, it is always possible to convert it afterwards with flightsDf.rdd.

How to remove all records in a RDD including null?

Answers (1)

Related Questions