Reputation: 383
I'm looking for method to change pyspark dataframe column type
from
df.printSchema()
To
Thank you, for your help, in advance.
Upvotes: 4
Views: 11721
Reputation: 101
Here is a useful example where you can change the schema for every column assuming you want the same type
from pyspark.sql.types import Row
from pyspark.sql.functions import *
df = sc.parallelize([
Row(isbn=1, count=1, average=10.6666666),
Row(isbn=2, count=1, average=11.1111111)
]).toDF()
df.printSchema()
df=df.select(*[col(x).cast('float') for x in df.columns]).printSchema()
outputs:
root
|-- average: double (nullable = true)
|-- count: long (nullable = true)
|-- isbn: long (nullable = true)
root
|-- average: float (nullable = true)
|-- count: float (nullable = true)
|-- isbn: float (nullable = true)
Upvotes: 0
Reputation: 4291
You have to replace the column with new schema. ArrayType take two parameters elementType and containsNull.
from pyspark.sql.types import *
from pyspark.sql.functions import udf
x = [("a",["b","c","d","e"]),("g",["h","h","d","e"])]
schema = StructType([StructField("key",StringType(), nullable=True),
StructField("values", ArrayType(StringType(), containsNull=False))])
df = spark.createDataFrame(x,schema = schema)
df.printSchema()
new_schema = ArrayType(StringType(), containsNull=True)
udf_foo = udf(lambda x:x, new_schema)
df.withColumn("values",udf_foo("values")).printSchema()
root
|-- key: string (nullable = true)
|-- values: array (nullable = true)
| |-- element: string (containsNull = false)
root
|-- key: string (nullable = true)
|-- values: array (nullable = true)
| |-- element: string (containsNull = true)
Upvotes: 4