Pyspark - casting multiple columns from Str to Int

Question

I'm attempting to cast multiple String columns to integers in a dataframe using PySpark 2.1.0. The data set is a rdd to begin, when created as a dataframe it generates the following error:

TypeError: StructType can not accept object 3 in type

A sample of what I'm trying to do:

import pyspark.sql.types as typ
from pyspark.sql.functions import *

labels = [
    ('A', typ.StringType()),
    ('B', typ.IntegerType()),
    ('C', typ.IntegerType()),
    ('D', typ.IntegerType()),
    ('E', typ.StringType()),
    ('F', typ.IntegerType())
]

rdd = sc.parallelize(["1", 2, 3, 4, "5", 6])
schema = typ.StructType([typ.StructField(e[0], e[1], False) for e in labels])
df = spark.createDataFrame(rdd, schema)
df.show()

cols_to_cast = [dt[0] for dt in df.dtypes if dt[1]=='string']
#df2 = df.select(*(c.cast("integer").alias(c) for c in cols_to_cast))

df2 = df.select(*( df[dt[0]].cast("integer").alias(dt[0])
                        for dt in df.dtypes if dt[1]=='string'))

df2.show()

The problem to begin with is the dataframe is not being created based on the RDD. Thereafter, I have tried two ways to cast (df2), the first is commented out.

Any suggestions? Alternatively is there anyway I could use the .withColumn functions for casting all columns in 1 go, instead of specifying each column? The actual dataset, although not large, has many columns.

Pushkr · Accepted Answer

Problem isnt your code, its your data. You are passing single list which will be treated as single column instead of six that you want.

Try rdd line as below and it should work fine.( Notice extra brackets around list )-

rdd = sc.parallelize([["1", 2, 3, 4, "5", 6]])

you code with above corrected line shows me following output :

+---+---+---+---+---+---+
|  A|  B|  C|  D|  E|  F|
+---+---+---+---+---+---+
|  1|  2|  3|  4|  5|  6|
+---+---+---+---+---+---+

+---+---+
|  A|  E|
+---+---+
|  1|  5|
+---+---+

Pyspark - casting multiple columns from Str to Int

Answers (1)

Related Questions