jgp
jgp

Reputation: 2091

Loading plain text dates in Spark v3 from CSV

I am trying to ingest a very basic CSV file with dates in Apache Spark. The complexity resides in the months being spelled out. For analytics purposes, I'd like to have those months as a date. Here is my CSV file:

Period,Total
"January 2000","5394"
"February 2000","5249"
"March 2000","6447"
"April 2000","6062"
"May 2000","6342"
"June 2000","6914"

I try to ingest the data using:

// Creates the schema
StructType schema = DataTypes.createStructType(new StructField[] {
    DataTypes.createStructField(
        "month",
        DataTypes.DateType,
        false),
    DataTypes.createStructField(
        "ct",
        DataTypes.IntegerType,
        true) });

// Reads a CSV file with header
Dataset<Row> df = spark.read().format("csv")
    .option("header", true)
    .option("dateFormat", "MMMM YYYY")
    .schema(schema)
    .load("my.csv");

Spark does not really like the dateFormat I use... and gives me:

org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'YYYY' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

I tried LLLL YYYY, M Y, MMMM Y... LLLL YYYY does not crash but returns a null field.

I use Spark v3.1, Java v8. Java is not the problem here ;).

Upvotes: 1

Views: 296

Answers (1)

Vincent Doba
Vincent Doba

Reputation: 5078

Uppercase letter Y is week year pattern, not year pattern. You should use lowercase y letters for pattern.

So your pattern should be MMMM yyyy instead of MMMM YYYY:

Dataset<Row> df = spark.read().format("csv")
    .option("header", true)
    .option("dateFormat", "MMMM yyyy")
    .schema(schema)
    .load("my.csv");

Upvotes: 1

Related Questions