Reputation: 343
I have a multi-column pyspark dataframe, and I need to convert the string types to the correct types, for example:
I'm doing like this currently
df = df.withColumn(col_name, col(col_name).cast('float') \
.withColumn(col_id, col(col_id).cast('int') \
.withColumn(col_city, col(col_city).cast('string') \
.withColumn(col_date, col(col_date).cast('date') \
.withColumn(col_code, col(col_code).cast('bigint')
is it possible to create a list with the types and pass it at once to all columns?
Upvotes: 6
Views: 3307
Reputation: 786
If you want to retain the ordering of columns in the dataframe as I did, here's an adaptation of @Alex Ott's answer:
mapping = {'col1': 'float', ....}
df = .... # your input data
for c in df.columns:
if c in mapping.keys():
df = df.withColumn(c, df[c].cast(mapping[c]))
This basically just iterates over each of the column names in the dataframe, performing any cast for only those that are in the mapping
dict.
Upvotes: 1
Reputation: 87154
You just need to have some mapping as dictionary, or something like, and then generate correct select
statement (you can use withColumn
, but usually it can lead to performance problems). Something like this:
import pyspark.sql.functions as F
mapping = {'col1':'float', ....}
df = .... # your input data
rest_cols = [F.col(cl) for cl in df.columns if cl not in mapping]
conv_cols = [F.col(cl_name).cast(cl_type).alias(cl_name)
for cl_name, cl_type in mapping.items()
if cl_name in df.columns]
conv_df.select(*rest_cols, *conv_cols)
Upvotes: 11