databricks CSV import timestamp NULL issue

Question

I'm trying to follow Databricks Academy Spark SQL course and I'm practising in Databricks community edition. At a point, I need to create a table from a CSV. This is CSV link.

I'm trying to create the table with UI. I'm checking "First row is header" and "Infer Schema" boxes. birthDate field is being shown as String in the preview pane. According to the course material, it should be Timestamp. But if I change field type to Timestamp, the data is loaded with null values. If I leave the birthDate field as it was (String), data is loading properly.

How can I deal with this issue?

Sample data in birthDate field :

1999-09-13T04:00:00.000+0000

Alex Ott · Accepted Answer

inferSchema for CSV file expects data in the specific format: yyyy-MM-dd'T'HH:mm:ss.SSSXXX, and the XXX pattern is matching to following specification of the timezone: Z; -08; -0830; -08:30; -083015; -08:30:15;, that doesn't match your data.

To fix this you may revert to the creating a table via notebook, and add following option to specify the timestamp format: .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSZ"), the whole read command would look as following:

df = spark.read.format("csv")
    .option("inferSchema", "true")
    .option("header", "true")
    .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
    .load(path_to_file)

Another possibility is to load the data, and then to use to_timestamp on the necessary column - it has better auto-detection capabilities than CSV infrerSchema:

from pyspark.sql.functions import to_timestamp, col
df = spark.read.format("csv")
    .option("inferSchema", "true")
    .option("header", "true")
    .load(path_to_file)
    .withColumn("birthDate", to_timestamp(col("birthDate")))

databricks CSV import timestamp NULL issue

Answers (1)

Related Questions