Steven
Steven

Reputation: 15258

spark read.csv incorrectly parsing timestamps

I have a csv file presence.csv in my datalake :

TIME,RESULT,PARENTLOCALIZATIONCODE,PARENTNAME,SIGFOXID,STATUS
2018-12-10T00:06:14.5464070Z,OK,A,A,19A9BC,OFF_PERIOD

I tried to read it but the TIME column output is wrong :

data = sqlContext.read.csv(
    'presence.csv',
    header='true', 
    inferSchema= 'true',  
    sep=","
)

data.show(truncate=False)
+----------------------+------+----------------------+----------+--------+----------+
|TIME                  |RESULT|PARENTLOCALIZATIONCODE|PARENTNAME|SIGFOXID|STATUS    |
+----------------------+------+----------------------+----------+--------+----------+
|2018-12-10 01:37:18.07|OK    |A                     |A         |19A9BC  |OFF_PERIOD|
+----------------------+------+----------------------+----------+--------+----------+

I first thought about a time difference between my server and gmt time, but one is 00:06:14 and the other one is 01:37:18 so about 01:31 difference which is just weird.

Do you have any idea why this convertion happens ?

Upvotes: 2

Views: 2043

Answers (1)

pault
pault

Reputation: 43504

From the docs for pyspark.sql.DataFrameReader.csv, the default timestampFormat is:

The main problem with your data is that you have 3 extra values in the fraction of a second. So for this data you need use timestampFormat="yyyy-MM-dd'T'hh:mm:ss:SSSSSSZZ"

data = spark.read.csv(
    'presence.csv',
    header='true', 
    inferSchema= 'true',  
    sep=",",
    timestampFormat="yyyy-MM-dd'T'hh:mm:ss:SSSSSSZZ"
)

data.show(truncate=False)
#+-----------------------+------+----------------------+----------+--------+----------+
#|TIME                   |RESULT|PARENTLOCALIZATIONCODE|PARENTNAME|SIGFOXID|STATUS    |
#+-----------------------+------+----------------------+----------+--------+----------+
#|2018-12-09 19:06:14.546|OK    |A                     |A         |19A9BC  |OFF_PERIOD|
#+-----------------------+------+----------------------+----------+--------+----------+

But as you can see here, the TIME column is being converted to local time (which on my system is GMT-4).

If this is not what you want, the "fix" depends on your spark version and is detailed in the answers on Spark Strutured Streaming automatically converts timestamp to local time.

If you applied version specific "fix", you'd see the following result:

df.show(truncate=False)
#+-----------------------+------+----------------------+----------+--------+----------+
#|TIME                   |RESULT|PARENTLOCALIZATIONCODE|PARENTNAME|SIGFOXID|STATUS    |
#+-----------------------+------+----------------------+----------+--------+----------+
#|2018-12-10 00:06:14.546|OK    |A                     |A         |19A9BC  |OFF_PERIOD|
#+-----------------------+------+----------------------+----------+--------+----------+

References:

Upvotes: 3

Related Questions