Reputation: 53

How to Read a parquet file , change datatype and write to another Parquet file in Hadoop using pyspark

My source parquet file has everything as string. My destination parquet file needs to convert this to different datatype like int, string, date etc. How do I do this?

Upvotes: 5

Answers (3)

HKS

Reputation: 53

Data File: | data_extract_id| Alien_Dollardiff| Alien_Dollar

|ab1def1gh-123-ea0| 0| 0

Script:

def createPrqtFParqt (datPath, parquetPath, inpustJsonSchema, outputdfSchema):
  print("## Parsing " + datPath)
  df = ssc.read.schema(outputdfSchema).parquet(datPath)
  print("## Writing " + parquetPath)
  df.write.mode("overwrite").parquet(parquetPath)

Output: An error occured while calling Parquet. Column: Alien_Dollardiff| Expected double Found BINARY.

Upvotes: 0

Som

Reputation: 6338

you may wanted to apply userdefined schema to speedup data loading. There are 2 ways to apply that-

using the input DDL-formatted string

spark.read.schema("a INT, b STRING, c DOUBLE").parquet("test.parquet")

Use StructType schema

customSchema = StructType([
        StructField("a", IntegerType(), True),
        StructField("b", StringType(), True),
        StructField("c", DoubleType(), True)])
spark.read.schema(customSchema).parquet("test.parquet")

Upvotes: 2

Shubham Jain

Reputation: 5536

You should read the file and then typecast all the columns as required and save them

from pyspark.sql.functions import *
df = spark.read.parquet('/path/to/file')
df = df.select(col('col1').cast('int'), col('col2').cast('string'))
df.write.parquet('/target/path')

Upvotes: 2

How to Read a parquet file , change datatype and write to another Parquet file in Hadoop using pyspark

Answers (3)

using the input DDL-formatted string

Use StructType schema

Related Questions