pySpark Timestamp as String to DateTime

Question

I read from a CSV where column time contains a timestamp with miliseconds '1414250523582' When I use TimestampType in schema it returnns NULL. The only way it ready my data is to use StringType.

Now I need this value to be a Datetime for forther processing. First I god rid of the to long timestamp with this:

df2 = df.withColumn("date", col("time")[0:10].cast(IntegerType()))

a schema checks says its a integer now. now i try to make it a datetime with

df3 = df2.withColumn("date", datetime.fromtimestamp(col("time")))

it returns

TypeError: an integer is required (got type Column)

when I google people always just use col("x") to read and transform data, so what do I make wrong here?

ryofthestorm · Accepted Answer

The schema checks are a bit tricky; the data in that column may be pyspark.sql.types.IntegerType, but that is not equivalent to Python's int type. The col function returns a pyspark.sql.column.Column object, which often do not play nicely with vanilla Python functions like datetime.fromtimestamp. This explains the TypeError. Even though the "date" data in the actual rows is an integer, col doesn't allow you to access it as an integer to feed into a python function quite so simply. To apply arbitrary Python code to that integer value, you can compile a udf pretty easily, but in this case, pyspark.sql.functions already has a solution for your unix timestamp. Try this: df3 = df2.withColumn("date", from_unixtime(col("time"))), and you should see a nice date in 2014 for your example.

Small note: This "date" column will be of StringType.

pySpark Timestamp as String to DateTime

Answers (1)

Related Questions