Is there a way to perform a cast or withColumn dataframe operation in PySpark without breaking a function chain?

Question

Something I enjoy about working with DataFrames is the ability to chain function calls together. The problem that I run into is that I'm struggling to find syntax that allows you to perform a cast or a withColumn operation that references a column of the DataFrame. For example:

counts = sqlContext.read.format("com.databricks.spark.csv") \
    .options(header=True) \
    .load(path) \
    .filter("cast(filterColumn as int) in (8, 11, 12)") \
    .withColumn('newColumn',df.oldColumn.cast("date")) \  #<-- df doesn't exist, silly!
    .groupBy(df.newColumn) \
    .count() \
    .collect()

The interesting thing to note is that performing the cast works great in the filter call. Unfortunately, it doesn't appear that either withColumn or groupBy support that kind of string api. I have tried to do

.withColumn('newColumn','cast(oldColumn as date)')

but only get yelled at for not having passed in an instance of column:

assert isinstance(col, Column), "col should be Column"

which is the exact same problem i run into when trying to do the same thing with groupBy

Do I simply need to bite the bullet and break them up?

df = sqlContext.read.format("com.databricks.spark.csv") \
    .options(header=True) \
    .load(path) \
    .filter("cast(filterColumn as int) in (8, 11, 12)")

counts = df.withColumn('newColumn',df.oldColumn.cast("date"))
    .groupBy(df.newColumn) \
    .count() \
    .collect()

zero323 · Accepted Answer

You can use col function:

from pyspark.sql.functions import col

...
    .withColumn('newColumn', col('oldColumn').cast('date'))

or expr:

from pyspark.sql.functions import expr

...
    .withColumn('newColumn', expr('cast(oldColumn as date)'))

Is there a way to perform a cast or withColumn dataframe operation in PySpark without breaking a function chain?

Answers (1)

Related Questions