Reputation: 4261
I loaded an RDD
from a csv file. However, this file includes invalid data. So, when I tried to output the contact of this RDD
with first
. The exception is
Caused by: java.lang.NumberFormatException: empty String
I hope to find solution to remove all records in the RDD
when one record includes empty string. In addition, this RDD
includes so many fields, so it is difficult to handle every field one by one. I remembers that DataFrame
has such function, such as na.drop()
. I need that this kind of function will work for RDD
.
The code I used is like:
//using case class
case class Flight(dest_id:Long, dest:String, crsdeptime:Double, deptime:Double, depdelaymins:Double, crsarrtime:Double)
//defining function
def parseFlight(str: String): Flight = {
val line = str.split(",")
Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5).toLong)
}
//loading data
val textRDD = sc.textFile("/root/data/data.csv")
val flightsRDD = textRDD.map(parseFlight)
update
When I used RDD converted by DateFrame. I found every line of RDD is Row object. How to extract some fields of one Row to build Edge object?
Upvotes: 0
Views: 807
Reputation: 28322
If the header in the csv file matches the variable names in the case class, then it's easier to read the data as a dataframe and then use na.drop()
.
val flightsDf = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("/root/data/data.csv")
.na.drop()
.as[Flight]
If you want a rdd, it is always possible to convert it afterwards with flightsDf.rdd
.
Upvotes: 1