Pyspark dataframe write and read changes schema

Question

I have a spark dataframe which contains both string and int columns.

But when I write the dataframe to a csv file and then load it later, the all the columns are loaded as string.

from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
df = spark.createDataFrame([("Alberto", 2), ("Dakota", 2)], 
                              ["Name", "count"])

Before:

df.printSchema()

Output:

root
  |-- Name: string (nullable = true)
  |-- count: long (nullable = true)


df.write.mode('overwrite').option('header', True).csv(filepath)

new_df = spark.read.option('header', True).csv(filepath)

After:

new_df.printSchema()

Output:

root
  |-- Name: string (nullable = true)
  |-- count: string (nullable = true)

How do I specify to store the schema as well while writing?

notNull · Accepted Answer

We don't have to specify schema while writing but we can specify the schema while reading.

Example:

from pyspark.sql.types import *
from pyspark.sql.functions import *
schema = StructType(
   [
     StructField('Name', StringType(), True),
    StructField('count', LongType(), True)
   ]
 )

#specify schema while reading
new_df = spark.read.schema(schema).option('header', True).csv(filepath)
new_df.printSchema()

#or else use inferschema option as true but specifying schema will be more robust
new_df = spark.read.option('header', True).option("inferSchema",True).csv(filepath)

Pyspark dataframe write and read changes schema

Answers (1)

Related Questions