mo1010
mo1010

Reputation: 185

Spark DateType column returning null

I would like to read in a file with the following structure with Apache Spark.

192 242 3 881250949 (the columns are tab separated)

from imbd I saw below note regarding date column: (unix seconds since 1/1/1970 UTC) -- not sure if this has anything to do with my issue

this is how I defined the schema:

from pyspark.sql.types import *
path_tocsv="dbfs:/tmp/data.csv"

schema = (StructType([
   StructField("user_id", IntegerType(), True),
   StructField("movie_id", IntegerType(), True),
   StructField("rating", IntegerType(), True),
   StructField("date", DateType(), True)]))

DataDF =spark.read.csv(path_tocsv, header=False,dateFormat='yyyy-MM-dd', 
schema=schema,sep='\t')

but I am getting null for dates:

    +-------+--------+------+----+
    |user_id|movie_id|rating|date|
    +-------+--------+------+----+
    |    196|     242|     3|null|
    |    186|     302|     3|null|
    |     22|     377|     1|null|
    |    244|      51|     2|null|
    |    166|     346|     1|null|
    |    298|     474|     4|null|
    |    115|     265|     2|null|
    |    253|     465|     5|null|
    |    305|     451|     3|null|
    |      6|      86|     3|null|
    +-------+--------+------+----+

Any suggestions?

Upvotes: 0

Views: 1514

Answers (1)

Dipanjan Mallick
Dipanjan Mallick

Reputation: 1739

You need to change your DateType column to LongType. See below -

from pyspark.sql.types import *
path_tocsv="dbfs:/tmp/data.csv"

schema = (StructType([
   StructField("user_id", IntegerType(), True),
   StructField("movie_id", IntegerType(), True),
   StructField("rating", IntegerType(), True),
   StructField("date", LongType(), True)]))

DataDF =spark.read.csv(path_tocsv, header=False,dateFormat='yyyy-MM-dd', 
schema=schema,sep='\t')

Once, your dataframe is created you could modify the date column as below -

from pyspark.sql.functions import *

df = DataDF.withColumn("date_modified", to_date(from_unixtime(col("date"))))#.drop("date")

Upvotes: 1

Related Questions