Reputation: 31
I have dates which are big endian like: YYYYMMDD in a CSV.
When I use simple string types, the data loads in correctly but when I used the DateType() object to define the column, I get nulls for everything. Am I able to define the date format somewhere or should Spark infer this automatically?
schema_comments= StructType([
StructField("id", StringType(), True),
StructField("date", DateType(), True),
])
Upvotes: 2
Views: 3601
Reputation: 21
The schema looks good to me.
You can define how spark reads the CSV using dateFormat
.
For example:
rc = spark.read.csv('yourCSV.csv', header=False,
dateFormat="yyyyddMM", schema=schema)
Upvotes: 2
Reputation: 1464
DateType expect standard timestamp format in spark so if you are providing it in schema it should be of the format 1997-02-28 10:30:00 if that's not the case read it using pandas or pyspark in string format and then you can convert it into a DateType() object using python and pyspark. Below is the sample code to convert the YYYYMMDD format into DateType in pyspark :
from pyspark.sql.functions import unix_timestamp
df2 = df.select('date_str', from_unixtime(unix_timestamp('date_str', 'yyyyMMdd')).alias('date'))
Upvotes: 1