Reputation: 239
I got this exception while playing with spark.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast
price
from string to int as it may truncate The type path of the target object is: - field (class: "scala.Int", name: "price") - root class: "org.spark.code.executable.Main.Record" You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
How Can this exception be solved? Here is the code
object Main {
case class Record(transactionDate: Timestamp, product: String, price: Int, paymentType: String, name: String, city: String, state: String, country: String,
accountCreated: Timestamp, lastLogin: Timestamp, latitude: String, longitude: String)
def main(args: Array[String]) {
System.setProperty("hadoop.home.dir", "C:\\winutils\\");
val schema = Encoders.product[Record].schema
val df = SparkConfig.sparkSession.read
.option("header", "true")
.csv("SalesJan2009.csv");
import SparkConfig.sparkSession.implicits._
val ds = df.as[Record]
//ds.groupByKey(body => body.state).count().show()
import org.apache.spark.sql.expressions.scalalang.typed.{
count => typedCount,
sum => typedSum
}
ds.groupByKey(body => body.state)
.agg(typedSum[Record](_.price).name("sum(price)"))
.withColumnRenamed("value", "group")
.alias("Summary by state")
.show()
}
Upvotes: 19
Views: 26789
Reputation: 7711
In my case, the problem was that I was using this:
case class OriginalData(ORDER_ID: Int, USER_ID: Int, ORDER_NUMBER: Int, ORDER_DOW: Int, ORDER_HOUR_OF_DAY: Int, DAYS_SINCE_PRIOR_ORDER: Double, ORDER_DETAIL: String)
However, in the CSV file, I had this for example:
Yes, having "Friday" where only integers representing days of the week should appear, means that I need to clean data. However, to be able to read my CSV file using spark.read.csv("data/jaimemontoya/01.csv")
, I used the following code, where the value of ORDER_DOW
now is String
, not Int
anymore:
case class OriginalData(ORDER_ID: Int, USER_ID: Int, ORDER_NUMBER: Int, ORDER_DOW: String, ORDER_HOUR_OF_DAY: Int, DAYS_SINCE_PRIOR_ORDER: Double, ORDER_DETAIL: String)
Upvotes: 0
Reputation: 23119
You read the csv file first and tried to convert to it to dataset which has different schema. Its better to pass the schema created while reading the csv file as below
val spark = SparkSession.builder()
.master("local")
.appName("test")
.getOrCreate()
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Record].schema
val ds = spark.read
.option("header", "true")
.schema(schema) // passing schema
.option("timestampFormat", "MM/dd/yyyy HH:mm") // passing timestamp format
.csv(path)// csv path
.as[Record] // convert to DS
The default timestampFormat is yyyy-MM-dd'T'HH:mm:ss.SSSXXX
so you also need to pass your custom timestampFormat.
Hope this helps
Upvotes: 49