Pyspark: how to read a csv file with timestamp?

Question

I have a .csv table like the folloiwng

import pandas as pd df = pd.read_csv('myFile.csv') df.head(3)

            identifier                 identifier_type         timestamp           device_lat   device_lon
0   68d62a1b-b928-4225-b445-9607415905b3    gaid         2020-03-19 03:03:00 UTC    44.808169   -73.522956
1   1675a629-a010-44b6-98a9-72d04793821f    gaid         2020-03-18 21:15:42 UTC    42.103894   -76.799164
2   0fe7a0b7-028e-459e-b5d8-b59d31800b8e    gaid         2020-03-18 23:39:54 UTC    43.182028   -77.672017

I am reading it with pyspark

schema= StructType([
        StructField("identifier", StringType(), True),
        StructField("identifier_type", StringType(), True),
        StructField("timestamp", DateType(), True),
        StructField("device_lat", FloatType(), True),
        StructField("device_lon", FloatType(), True)])

myTable = spark.read.format("csv").schema(schema).load('NY_data/f0.csv') 
myTable = myTable[myTable['device_lat']>0]
myTable.show(3)

    +--------------------+---------------+----------+----------+----------+
|          identifier|identifier_type| timestamp|device_lat|device_lon|
+--------------------+---------------+----------+----------+----------+
|68d62a1b-b928-422...|           gaid|2020-03-19|  44.80817| -73.52296|
|1675a629-a010-44b...|           gaid|2020-03-18| 42.103893|-76.799164|
|0fe7a0b7-028e-459...|           gaid|2020-03-18|  43.18203| -77.67202|
+--------------------+---------------+----------+----------+----------+

Why did the information of minutes, hour and seconds disappear?

If I try TimestampType type instead of DateType

schema= StructType([
        StructField("identifier", StringType(), True),
        StructField("identifier_type", StringType(), True),
        StructField("timestamp", TimestampType(), True),
        StructField("device_lat", FloatType(), True),
        StructField("device_lon", FloatType(), True)])

myTable = spark.read.format("csv").schema(schema).load('NY_data/f0.csv') 
myTable = myTable[myTable['device_lat']>0]
sqlContext.registerDataFrameAsTable(myTable, "myTable")

This is what I get

myTable.show(3)
+----------+---------------+---------+----------+----------+
|identifier|identifier_type|timestamp|device_lat|device_lon|
+----------+---------------+---------+----------+----------+
+----------+---------------+---------+----------+----------+

The type of variables are.

df.dtypes
identifier          object
identifier_type     object
timestamp           object
device_lat         float64
device_lon         float64
dtype: object

Alex W · Accepted Answer

Purely guessing, but I think you may need a TimestampType type instead of DateType.

The documentation for DateType only mentions Month/Day/Year:

A date type, supporting "0001-01-01" through "9999-12-31". Please use the singleton DataTypes.DateType.

Internally, this is represented as the number of days from epoch (1970-01-01 00:00:00 UTC).

According to Pyspark docs, when using spark.read() you can specify the timestamp format:

timestampFormat – sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type. If None is set, it uses the default value, yyyy-MM-dd'T'HH:mm:ss.SSSXXX. The default value looks like it's the ISO standard so if your CSV file has a different timestamp format it won't work without explicitly setting the correct format value.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader

So, if your timestamp CSV values are different than the default ISO 8601 standard format (e.g. 2020-03-22T21:51:29Z), you would need to match whatever the CSV date/time format is to the corresponding java.text.SimpleDate format. The date/time format characters are listed in the docs for Java:

For CSV values such as 2020-01-19 19:30:30 UTC the date format string would be similar to: yyyy-mm-dd hh:mm:ss z

https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html

Pyspark: how to read a csv file with timestamp?

Answers (1)

Related Questions