Reputation: 7255
I have a .csv table like the folloiwng
import pandas as pd df = pd.read_csv('myFile.csv') df.head(3)
identifier identifier_type timestamp device_lat device_lon
0 68d62a1b-b928-4225-b445-9607415905b3 gaid 2020-03-19 03:03:00 UTC 44.808169 -73.522956
1 1675a629-a010-44b6-98a9-72d04793821f gaid 2020-03-18 21:15:42 UTC 42.103894 -76.799164
2 0fe7a0b7-028e-459e-b5d8-b59d31800b8e gaid 2020-03-18 23:39:54 UTC 43.182028 -77.672017
I am reading it with pyspark
schema= StructType([
StructField("identifier", StringType(), True),
StructField("identifier_type", StringType(), True),
StructField("timestamp", DateType(), True),
StructField("device_lat", FloatType(), True),
StructField("device_lon", FloatType(), True)])
myTable = spark.read.format("csv").schema(schema).load('NY_data/f0.csv')
myTable = myTable[myTable['device_lat']>0]
myTable.show(3)
+--------------------+---------------+----------+----------+----------+
| identifier|identifier_type| timestamp|device_lat|device_lon|
+--------------------+---------------+----------+----------+----------+
|68d62a1b-b928-422...| gaid|2020-03-19| 44.80817| -73.52296|
|1675a629-a010-44b...| gaid|2020-03-18| 42.103893|-76.799164|
|0fe7a0b7-028e-459...| gaid|2020-03-18| 43.18203| -77.67202|
+--------------------+---------------+----------+----------+----------+
Why did the information of minutes, hour and seconds disappear?
If I try TimestampType
type instead of DateType
schema= StructType([
StructField("identifier", StringType(), True),
StructField("identifier_type", StringType(), True),
StructField("timestamp", TimestampType(), True),
StructField("device_lat", FloatType(), True),
StructField("device_lon", FloatType(), True)])
myTable = spark.read.format("csv").schema(schema).load('NY_data/f0.csv')
myTable = myTable[myTable['device_lat']>0]
sqlContext.registerDataFrameAsTable(myTable, "myTable")
This is what I get
myTable.show(3)
+----------+---------------+---------+----------+----------+
|identifier|identifier_type|timestamp|device_lat|device_lon|
+----------+---------------+---------+----------+----------+
+----------+---------------+---------+----------+----------+
The type of variables are.
df.dtypes
identifier object
identifier_type object
timestamp object
device_lat float64
device_lon float64
dtype: object
Upvotes: 1
Views: 934
Reputation: 38253
Purely guessing, but I think you may need a TimestampType
type instead of DateType
.
The documentation for DateType
only mentions Month/Day/Year:
A date type, supporting "0001-01-01" through "9999-12-31". Please use the singleton DataTypes.DateType.
Internally, this is represented as the number of days from epoch (1970-01-01 00:00:00 UTC).
According to Pyspark docs, when using spark.read()
you can specify the timestamp format:
timestampFormat – sets the string that indicates a timestamp format. Custom date formats follow the formats at
java.text.SimpleDateFormat
. This applies to timestamp type. If None is set, it uses the default value,yyyy-MM-dd'T'HH:mm:ss.SSSXXX
. The default value looks like it's the ISO standard so if your CSV file has a different timestamp format it won't work without explicitly setting the correct format value.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader
So, if your timestamp CSV values are different than the default ISO 8601 standard format (e.g. 2020-03-22T21:51:29Z
), you would need to match whatever the CSV date/time format is to the corresponding java.text.SimpleDate
format. The date/time format characters are listed in the docs for Java:
For CSV values such as 2020-01-19 19:30:30 UTC
the date format string would be similar to: yyyy-mm-dd hh:mm:ss z
https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html
Upvotes: 2