Reputation: 712
I read from a CSV where column time contains a timestamp with miliseconds '1414250523582' When I use TimestampType in schema it returnns NULL. The only way it ready my data is to use StringType.
Now I need this value to be a Datetime for forther processing. First I god rid of the to long timestamp with this:
df2 = df.withColumn("date", col("time")[0:10].cast(IntegerType()))
a schema checks says its a integer now. now i try to make it a datetime with
df3 = df2.withColumn("date", datetime.fromtimestamp(col("time")))
it returns
TypeError: an integer is required (got type Column)
when I google people always just use col("x") to read and transform data, so what do I make wrong here?
Upvotes: 0
Views: 755
Reputation: 38
The schema checks are a bit tricky; the data in that column may be pyspark.sql.types.IntegerType
, but that is not equivalent to Python's int
type. The col
function returns a pyspark.sql.column.Column
object, which often do not play nicely with vanilla Python functions like datetime.fromtimestamp
. This explains the TypeError. Even though the "date" data in the actual rows is an integer, col
doesn't allow you to access it as an integer to feed into a python function quite so simply. To apply arbitrary Python code to that integer value, you can compile a udf pretty easily, but in this case, pyspark.sql.functions already has a solution for your unix timestamp. Try this: df3 = df2.withColumn("date", from_unixtime(col("time")))
, and you should see a nice date in 2014 for your example.
Small note: This "date" column will be of StringType
.
Upvotes: 1