Reputation: 1522
Something I enjoy about working with DataFrames
is the ability to chain function calls together. The problem that I run into is that I'm struggling to find syntax that allows you to perform a cast
or a withColumn
operation that references a column of the DataFrame
. For example:
counts = sqlContext.read.format("com.databricks.spark.csv") \
.options(header=True) \
.load(path) \
.filter("cast(filterColumn as int) in (8, 11, 12)") \
.withColumn('newColumn',df.oldColumn.cast("date")) \ #<-- df doesn't exist, silly!
.groupBy(df.newColumn) \
.count() \
.collect()
The interesting thing to note is that performing the cast works great in the filter
call. Unfortunately, it doesn't appear that either withColumn
or groupBy
support that kind of string api. I have tried to do
.withColumn('newColumn','cast(oldColumn as date)')
but only get yelled at for not having passed in an instance of column
:
assert isinstance(col, Column), "col should be Column"
which is the exact same problem i run into when trying to do the same thing with groupBy
Do I simply need to bite the bullet and break them up?
df = sqlContext.read.format("com.databricks.spark.csv") \
.options(header=True) \
.load(path) \
.filter("cast(filterColumn as int) in (8, 11, 12)")
counts = df.withColumn('newColumn',df.oldColumn.cast("date"))
.groupBy(df.newColumn) \
.count() \
.collect()
Upvotes: 4
Views: 6372
Reputation: 330183
You can use col
function:
from pyspark.sql.functions import col
...
.withColumn('newColumn', col('oldColumn').cast('date'))
or expr
:
from pyspark.sql.functions import expr
...
.withColumn('newColumn', expr('cast(oldColumn as date)'))
Upvotes: 8