Ivan Lee
Ivan Lee

Reputation: 4261

How to remove all records in a RDD including null?

I loaded an RDD from a csv file. However, this file includes invalid data. So, when I tried to output the contact of this RDD with first. The exception is

Caused by: java.lang.NumberFormatException: empty String

I hope to find solution to remove all records in the RDD when one record includes empty string. In addition, this RDD includes so many fields, so it is difficult to handle every field one by one. I remembers that DataFrame has such function, such as na.drop(). I need that this kind of function will work for RDD.

The code I used is like:

//using case class
case class Flight(dest_id:Long, dest:String, crsdeptime:Double, deptime:Double, depdelaymins:Double, crsarrtime:Double)

//defining function
def parseFlight(str: String): Flight = {
  val line = str.split(",")
  Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5).toLong)
}

//loading data
val textRDD = sc.textFile("/root/data/data.csv")
val flightsRDD = textRDD.map(parseFlight)

update

When I used RDD converted by DateFrame. I found every line of RDD is Row object. How to extract some fields of one Row to build Edge object?

Upvotes: 0

Views: 807

Answers (1)

Shaido
Shaido

Reputation: 28322

If the header in the csv file matches the variable names in the case class, then it's easier to read the data as a dataframe and then use na.drop().

val flightsDf = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("/root/data/data.csv")
  .na.drop()
  .as[Flight]

If you want a rdd, it is always possible to convert it afterwards with flightsDf.rdd.

Upvotes: 1

Related Questions