Reputation: 2091
I am trying to ingest a very basic CSV file with dates in Apache Spark. The complexity resides in the months being spelled out. For analytics purposes, I'd like to have those months as a date. Here is my CSV file:
Period,Total
"January 2000","5394"
"February 2000","5249"
"March 2000","6447"
"April 2000","6062"
"May 2000","6342"
"June 2000","6914"
I try to ingest the data using:
// Creates the schema
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"month",
DataTypes.DateType,
false),
DataTypes.createStructField(
"ct",
DataTypes.IntegerType,
true) });
// Reads a CSV file with header
Dataset<Row> df = spark.read().format("csv")
.option("header", true)
.option("dateFormat", "MMMM YYYY")
.schema(schema)
.load("my.csv");
Spark does not really like the dateFormat
I use... and gives me:
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'YYYY' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
I tried LLLL YYYY
, M Y
, MMMM Y
... LLLL YYYY
does not crash but returns a null field.
I use Spark v3.1, Java v8. Java is not the problem here ;).
Upvotes: 1
Views: 296
Reputation: 5078
Uppercase letter Y
is week year pattern, not year pattern. You should use lowercase y
letters for pattern.
So your pattern should be MMMM yyyy
instead of MMMM YYYY
:
Dataset<Row> df = spark.read().format("csv")
.option("header", true)
.option("dateFormat", "MMMM yyyy")
.schema(schema)
.load("my.csv");
Upvotes: 1