Joe
Joe

Reputation: 13091

Pyspark - How to set the schema when reading parquet file from another DF?

I have DF1 with schema:

df1 = spark.read.parquet(load_path1)
df1.printSchema()

root
 |-- PRODUCT_OFFERING_ID: string (nullable = true)
 |-- CREATED_BY: string (nullable = true)
 |-- CREATION_DATE: string (nullable = true)

and DF2:

df2 = spark.read.parquet(load_path2)
df2.printSchema()
root
 |-- PRODUCT_OFFERING_ID: decimal(38,10) (nullable = true)
 |-- CREATED_BY: decimal(38,10) (nullable = true)
 |-- CREATION_DATE: timestamp (nullable = true)

Now I want to Union these 2 dataframes..
Sometime it gives errors when I try to UNION these 2 DFs because of different schemas..

How to set for DF2 to have exact same schema (during the load time) as DF1?

I tried with:

df2 = spark.read.parquet(load_path2).schema(df1.schema)

Getting error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'StructType' object is not callable

Or should I CAST it instead (once DF2 is read)?

Thanks.

Upvotes: 3

Views: 10522

Answers (1)

notNull
notNull

Reputation: 31470

Move .schema() before .parquet() then spark will read the parquet file with the specified schema

df2 = spark.read.schema(df1.schema).parquet(load_path2)

Upvotes: 10

Related Questions