Reputation: 1655
I am using spark-excel package for processing ms excel files using spark 2.2. Some of the files are getting failed to load as a spark dataframe with below exception. If someone have faced this issue can you please help to fix such data type issues?
After analyzing I found at that if column name is not a string, it ends up giving below exception, If I manually change the column name to string from integer, it works fine.
Code:
val excelDF = spark.read.
format("com.crealytics.spark.excel").
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "true").
option("inferSchema", "true").
option("addColorColumns", "False").
option("sheetName", sheetName).
load(filePath)
Exception:
java.lang.IllegalStateException: Cannot get a STRING value from a NUMERIC cell
at org.apache.poi.xssf.usermodel.XSSFCell.typeMismatch(XSSFCell.java:1077)
at org.apache.poi.xssf.usermodel.XSSFCell.getRichStringCellValue(XSSFCell.java:395)
at org.apache.poi.xssf.usermodel.XSSFCell.getStringCellValue(XSSFCell.java:347)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1$$anonfun$10.apply(ExcelRelation.scala:206)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1$$anonfun$10.apply(ExcelRelation.scala:205)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:205)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:204)
at scala.Option.getOrElse(Option.scala:121)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:204)
at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:91)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:37)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
Upvotes: 2
Views: 1083
Reputation: 21
The new version of com.crealytics:spark-excel_2.11:0.12.5
library works with non string column/header names as well.
Upvotes: 1
Reputation: 99
There is probably a more elegant answer for this I would post this as comment, but do not have the required reputation.
I always try to ensure that my column headers are strings.
Also as a rule I do not have numeric characters in the column headers and we have a simple script that replaces the numeric with alphabetical characters (i.e. 1 by one).
Upvotes: 0