Reputation: 13
I'm trying to follow Databricks
Academy Spark SQL course and I'm practising in Databricks
community edition. At a point, I need to create a table from a CSV. This is CSV link.
I'm trying to create the table with UI. I'm checking "First row is header" and "Infer Schema" boxes. birthDate
field is being shown as String in the preview pane. According to the course material, it should be Timestamp. But if I change field type to Timestamp, the data is loaded with null values. If I leave the birthDate
field as it was (String
), data is loading properly.
How can I deal with this issue?
Sample data in birthDate
field :
1999-09-13T04:00:00.000+0000
Upvotes: 1
Views: 1378
Reputation: 87279
inferSchema
for CSV file expects data in the specific format: yyyy-MM-dd'T'HH:mm:ss.SSSXXX
, and the XXX
pattern is matching to following specification of the timezone: Z; -08; -0830; -08:30; -083015; -08:30:15;
, that doesn't match your data.
To fix this you may revert to the creating a table via notebook, and add following option to specify the timestamp format: .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
, the whole read command would look as following:
df = spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
.load(path_to_file)
Another possibility is to load the data, and then to use to_timestamp
on the necessary column - it has better auto-detection capabilities than CSV infrerSchema
:
from pyspark.sql.functions import to_timestamp, col
df = spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.load(path_to_file)
.withColumn("birthDate", to_timestamp(col("birthDate")))
Upvotes: 1