user2763088
user2763088

Reputation: 383

how to change pyspark data frame column data type?

I'm looking for method to change pyspark dataframe column type

from

df.printSchema()

enter image description here

To

enter image description here

Thank you, for your help, in advance.

Upvotes: 4

Views: 11721

Answers (2)

Flufylobster
Flufylobster

Reputation: 101

Here is a useful example where you can change the schema for every column assuming you want the same type

from pyspark.sql.types import Row
from pyspark.sql.functions import *
df = sc.parallelize([
Row(isbn=1, count=1, average=10.6666666),
Row(isbn=2, count=1, average=11.1111111)
]).toDF()

df.printSchema()
df=df.select(*[col(x).cast('float') for x in df.columns]).printSchema()

outputs:

  root
  |-- average: double (nullable = true)
  |-- count: long (nullable = true)
  |-- isbn: long (nullable = true)
  root
  |-- average: float (nullable = true)
  |-- count: float (nullable = true)
  |-- isbn: float (nullable = true)

Upvotes: 0

pauli
pauli

Reputation: 4291

You have to replace the column with new schema. ArrayType take two parameters elementType and containsNull.

from pyspark.sql.types import *
from pyspark.sql.functions import udf
x = [("a",["b","c","d","e"]),("g",["h","h","d","e"])]
schema = StructType([StructField("key",StringType(), nullable=True),
                     StructField("values", ArrayType(StringType(), containsNull=False))])

df = spark.createDataFrame(x,schema = schema)
df.printSchema()
new_schema = ArrayType(StringType(), containsNull=True)
udf_foo = udf(lambda x:x, new_schema)
df.withColumn("values",udf_foo("values")).printSchema()



root
 |-- key: string (nullable = true)
 |-- values: array (nullable = true)
 |    |-- element: string (containsNull = false)

root
 |-- key: string (nullable = true)
 |-- values: array (nullable = true)
 |    |-- element: string (containsNull = true)

Upvotes: 4

Related Questions