Reputation: 18003
I have a file like this:
1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true
I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok
So, I am doing the following:
val dfPG = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "false")
.option("nullValue", "")
.load("/FileStore/tables/SO_QQQ.txt")
and setting the fields explicitly:
val dfPG2 =
dfPG
.map {r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
r.getString(6) //r.getString(6).toInt
) }
I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.
See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.
r.getString(6).toInt
I must be over-complicating and/or missing something.
Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.
Upvotes: 0
Views: 121
Reputation: 26
That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField(...),
StructField("your_integer_field", IntegerType, true),
...
))
and provide it to the reader:
val dfPG = spark.read.format("csv")
.schema(schema)
...
.load("/FileStore/tables/SO_QQQ.txt")
Upvotes: 1