Is it possible to cast multiple columns of a dataframe in pyspark?

I have a multi-column pyspark dataframe, and I need to convert the string types to the correct types, for example:

I'm doing like this currently

df = df.withColumn(col_name, col(col_name).cast('float') \
.withColumn(col_id, col(col_id).cast('int') \
.withColumn(col_city, col(col_city).cast('string') \
.withColumn(col_date, col(col_date).cast('date') \
.withColumn(col_code, col(col_code).cast('bigint')

is it possible to create a list with the types and pass it at once to all columns?

Upvotes: 6

Views: 3307

Answers (2)

teejay
teejay

Reputation: 786

If you want to retain the ordering of columns in the dataframe as I did, here's an adaptation of @Alex Ott's answer:

mapping = {'col1': 'float', ....}
df = .... # your input data
for c in df.columns:
   if c in mapping.keys():
      df = df.withColumn(c, df[c].cast(mapping[c]))

This basically just iterates over each of the column names in the dataframe, performing any cast for only those that are in the mapping dict.

Upvotes: 1

Alex Ott
Alex Ott

Reputation: 87154

You just need to have some mapping as dictionary, or something like, and then generate correct select statement (you can use withColumn, but usually it can lead to performance problems). Something like this:

import pyspark.sql.functions as F
mapping = {'col1':'float', ....}
df = .... # your input data
rest_cols = [F.col(cl) for cl in df.columns if cl not in mapping]
conv_cols = [F.col(cl_name).cast(cl_type).alias(cl_name) 
   for cl_name, cl_type in mapping.items()
   if cl_name in df.columns]
conv_df.select(*rest_cols, *conv_cols)

Upvotes: 11

Related Questions