Reputation: 53
My source parquet file has everything as string. My destination parquet file needs to convert this to different datatype like int, string, date etc. How do I do this?
Upvotes: 5
Views: 20228
Reputation: 53
Data File: | data_extract_id| Alien_Dollardiff| Alien_Dollar
|ab1def1gh-123-ea0| 0| 0
Script:
def createPrqtFParqt (datPath, parquetPath, inpustJsonSchema, outputdfSchema):
print("## Parsing " + datPath)
df = ssc.read.schema(outputdfSchema).parquet(datPath)
print("## Writing " + parquetPath)
df.write.mode("overwrite").parquet(parquetPath)
Output: An error occured while calling Parquet. Column: Alien_Dollardiff| Expected double Found BINARY.
Upvotes: 0
Reputation: 6338
you may wanted to apply userdefined schema to speedup data loading. There are 2 ways to apply that-
spark.read.schema("a INT, b STRING, c DOUBLE").parquet("test.parquet")
customSchema = StructType([
StructField("a", IntegerType(), True),
StructField("b", StringType(), True),
StructField("c", DoubleType(), True)])
spark.read.schema(customSchema).parquet("test.parquet")
Upvotes: 2
Reputation: 5536
You should read the file and then typecast all the columns as required and save them
from pyspark.sql.functions import *
df = spark.read.parquet('/path/to/file')
df = df.select(col('col1').cast('int'), col('col2').cast('string'))
df.write.parquet('/target/path')
Upvotes: 2