NuValue
NuValue

Reputation: 463

Efficient way to transform several columns to string in PySpark

It is well documented on SO (link 1, link 2, link 3, ...) how to transform a single variable to string type in PySpark by analogy:

from pyspark.sql.types import StringType    
spark_df = spark_df.withColumn('name_of_column', spark_df[name_of_column].cast(StringType()))

However, when you have several columns that you want transform to string type, there are several methods to achieve it:

Using for loops -- Successful approach in my code:

Trivial example:

to_str = ['age', 'weight', 'name', 'id']
for col in to_str:
  spark_df = spark_df.withColumn(col, spark_df[col].cast(StringType()))

which is a valid method but I believe not the optimal one that I am looking for.

Using list comprehensions -- Not succesful in my code:

My wrong example:

spark_df = spark_df.select(*(col(c).cast("string").alias(c) for c in to_str))

Not succesful as I receive the error message:

TypeError: 'str' object is not callable

My question then would be: which would be the optimal way to transform several columns to string in PySpark based on a list of column names like to_str in my example?

Thanks in advance for your advice.

POSTERIOR CLARIFICATION EDIT:

Thanks to @Rumoku and @pault feedback:

Both code lines are correct:

spark_df = spark_df.select(*(col(c).cast("string").alias(c) for c in to_str)) # My initial list comprehension expression is correct.

and

spark_df = spark_df.select([col(c).cast(StringType()).alias(c) for c in to_str]) # Initial answer proposed by @Rumoku is correct.

I was receiving the error messages from PySpark given that I previously changed the name of the object to_str for col. As @pault explains: col (the list with the desired string variables) had the same name as the function col of the list comprehension, that´s why PySpark complained. Simply renaming col to to_str, and updating spark-notebook fixed everything.

Upvotes: 4

Views: 14599

Answers (2)

Amit Pathak
Amit Pathak

Reputation: 1367

Not sure what is col() for the list comprehension part in your solution, but anyone looking for the solution can try this -

from pyspark.sql.types import StringType 

to_str = ['age', 'weight', 'name', 'id']

spark_df = spark_df.select(
  [spark_df[c].cast(StringType()).alias(c) for c in to_str]
)

To replace all the columns to str type, replace to_str with spark_df.columns.

Upvotes: 0

vvg
vvg

Reputation: 6385

It should be:

spark_df = spark_df.select([col(c).cast(StringType()).alias(c) for c in to_str])

Upvotes: 3

Related Questions