Sara
Sara

Reputation: 31

DateType() definition giving Null in PySpark?

I have dates which are big endian like: YYYYMMDD in a CSV.

When I use simple string types, the data loads in correctly but when I used the DateType() object to define the column, I get nulls for everything. Am I able to define the date format somewhere or should Spark infer this automatically?

schema_comments= StructType([
    StructField("id", StringType(), True),
    StructField("date", DateType(), True),
])

Upvotes: 2

Views: 3601

Answers (2)

toxicPanda
toxicPanda

Reputation: 21

The schema looks good to me.
You can define how spark reads the CSV using dateFormat.

For example:

rc = spark.read.csv('yourCSV.csv', header=False,
                    dateFormat="yyyyddMM", schema=schema)

Upvotes: 2

Ankit Kumar Namdeo
Ankit Kumar Namdeo

Reputation: 1464

DateType expect standard timestamp format in spark so if you are providing it in schema it should be of the format 1997-02-28 10:30:00 if that's not the case read it using pandas or pyspark in string format and then you can convert it into a DateType() object using python and pyspark. Below is the sample code to convert the YYYYMMDD format into DateType in pyspark :

from pyspark.sql.functions import unix_timestamp

df2 = df.select('date_str', from_unixtime(unix_timestamp('date_str', 'yyyyMMdd')).alias('date'))

Upvotes: 1

Related Questions