Reputation: 185
I would like to read in a file with the following structure with Apache Spark.
192 242 3 881250949 (the columns are tab separated)
from imbd I saw below note regarding date column: (unix seconds since 1/1/1970 UTC) -- not sure if this has anything to do with my issue
this is how I defined the schema:
from pyspark.sql.types import *
path_tocsv="dbfs:/tmp/data.csv"
schema = (StructType([
StructField("user_id", IntegerType(), True),
StructField("movie_id", IntegerType(), True),
StructField("rating", IntegerType(), True),
StructField("date", DateType(), True)]))
DataDF =spark.read.csv(path_tocsv, header=False,dateFormat='yyyy-MM-dd',
schema=schema,sep='\t')
but I am getting null for dates:
+-------+--------+------+----+
|user_id|movie_id|rating|date|
+-------+--------+------+----+
| 196| 242| 3|null|
| 186| 302| 3|null|
| 22| 377| 1|null|
| 244| 51| 2|null|
| 166| 346| 1|null|
| 298| 474| 4|null|
| 115| 265| 2|null|
| 253| 465| 5|null|
| 305| 451| 3|null|
| 6| 86| 3|null|
+-------+--------+------+----+
Any suggestions?
Upvotes: 0
Views: 1514
Reputation: 1739
You need to change your DateType
column to LongType
. See below -
from pyspark.sql.types import *
path_tocsv="dbfs:/tmp/data.csv"
schema = (StructType([
StructField("user_id", IntegerType(), True),
StructField("movie_id", IntegerType(), True),
StructField("rating", IntegerType(), True),
StructField("date", LongType(), True)]))
DataDF =spark.read.csv(path_tocsv, header=False,dateFormat='yyyy-MM-dd',
schema=schema,sep='\t')
Once, your dataframe is created you could modify the date
column as below -
from pyspark.sql.functions import *
df = DataDF.withColumn("date_modified", to_date(from_unixtime(col("date"))))#.drop("date")
Upvotes: 1