HKS
HKS

Reputation: 53

How to Read a parquet file , change datatype and write to another Parquet file in Hadoop using pyspark

My source parquet file has everything as string. My destination parquet file needs to convert this to different datatype like int, string, date etc. How do I do this?

Upvotes: 5

Views: 20228

Answers (3)

HKS
HKS

Reputation: 53

Data File: | data_extract_id| Alien_Dollardiff| Alien_Dollar

|ab1def1gh-123-ea0| 0| 0

Script:

def createPrqtFParqt (datPath, parquetPath, inpustJsonSchema, outputdfSchema):
  print("## Parsing " + datPath)
  df = ssc.read.schema(outputdfSchema).parquet(datPath)
  print("## Writing " + parquetPath)
  df.write.mode("overwrite").parquet(parquetPath)

Output: An error occured while calling Parquet. Column: Alien_Dollardiff| Expected double Found BINARY.

Upvotes: 0

Som
Som

Reputation: 6338

you may wanted to apply userdefined schema to speedup data loading. There are 2 ways to apply that-

using the input DDL-formatted string

spark.read.schema("a INT, b STRING, c DOUBLE").parquet("test.parquet")

Use StructType schema

customSchema = StructType([
        StructField("a", IntegerType(), True),
        StructField("b", StringType(), True),
        StructField("c", DoubleType(), True)])
spark.read.schema(customSchema).parquet("test.parquet")

Upvotes: 2

Shubham Jain
Shubham Jain

Reputation: 5536

You should read the file and then typecast all the columns as required and save them

from pyspark.sql.functions import *
df = spark.read.parquet('/path/to/file')
df = df.select(col('col1').cast('int'), col('col2').cast('string'))
df.write.parquet('/target/path')

Upvotes: 2

Related Questions