Reputation: 463
It is well documented on SO (link 1, link 2, link 3, ...) how to transform a single variable to string
type in PySpark
by analogy:
from pyspark.sql.types import StringType
spark_df = spark_df.withColumn('name_of_column', spark_df[name_of_column].cast(StringType()))
However, when you have several columns that you want transform to string
type, there are several methods to achieve it:
Using for
loops -- Successful approach in my code:
Trivial example:
to_str = ['age', 'weight', 'name', 'id']
for col in to_str:
spark_df = spark_df.withColumn(col, spark_df[col].cast(StringType()))
which is a valid method but I believe not the optimal one that I am looking for.
Using list comprehensions -- Not succesful in my code:
My wrong example:
spark_df = spark_df.select(*(col(c).cast("string").alias(c) for c in to_str))
Not succesful as I receive the error message:
TypeError: 'str' object is not callable
My question then would be: which would be the optimal way to transform several columns to string in PySpark based on a list of column names like to_str
in my example?
Thanks in advance for your advice.
POSTERIOR CLARIFICATION EDIT:
Thanks to @Rumoku and @pault feedback:
Both code lines are correct:
spark_df = spark_df.select(*(col(c).cast("string").alias(c) for c in to_str)) # My initial list comprehension expression is correct.
and
spark_df = spark_df.select([col(c).cast(StringType()).alias(c) for c in to_str]) # Initial answer proposed by @Rumoku is correct.
I was receiving the error messages from PySpark
given that I previously changed the name of the object to_str
for col
. As @pault explains: col
(the list with the desired string variables) had the same name as the function col
of the list comprehension, that´s why PySpark
complained. Simply renaming col
to to_str
, and updating spark-notebook
fixed everything.
Upvotes: 4
Views: 14599
Reputation: 1367
Not sure what is col()
for the list comprehension part in your solution, but anyone looking for the solution can try this -
from pyspark.sql.types import StringType
to_str = ['age', 'weight', 'name', 'id']
spark_df = spark_df.select(
[spark_df[c].cast(StringType()).alias(c) for c in to_str]
)
To replace all the columns to str
type, replace to_str
with spark_df.columns
.
Upvotes: 0
Reputation: 6385
It should be:
spark_df = spark_df.select([col(c).cast(StringType()).alias(c) for c in to_str])
Upvotes: 3