Penny T
Penny T

Reputation: 11

Uploading custom schema from a csv file using pyspark

I have a query about loading the schema onto cdsw using pyspark. I have a dataframe which is created using a csv file

data_1 = spark.read.csv("demo.csv",sep = ",", header = True, inferSchema = True)

The data types are read wrong for most of the variable i.e around 60 of them, I can't change them manually all the time. I know what the schema must look like.

Is there any way, I could load the schema as well from a csv file? Like it could read the dataset and override the schema which I am uploading.

Upvotes: 0

Views: 3416

Answers (2)

pltc
pltc

Reputation: 6082

You can load the schema.csv and build an actual schema programmatically, then use it to load actual data.

Notes: The types in schema.csv must match with Spark datatypes

import pandas as pd
from pyspark.sql.types import *

# schema.csv
# variable,data_type
# V1,Double
# V2,String
# V3,Double
# V4,Integer

# data.csv
# V1,V2,V3,V4
# 1.2,a,3.4,5

dtypes = pd.read_csv('schema.csv').to_records(index=False).tolist()
fields = [T.StructField(dtype[0], globals()[f'{dtype[1]}Type']()) for dtype in dtypes]
schema = StructType(fields)

df = spark.read.csv('data.csv', header=True, schema=schema)

df.printSchema()
# root
#  |-- V1: double (nullable = true)
#  |-- V2: string (nullable = true)
#  |-- V3: double (nullable = true)
#  |-- V4: integer (nullable = true)

df.show()
# +---+---+---+---+
# | V1| V2| V3| V4|
# +---+---+---+---+
# |1.2|  a|3.4|  5|
# +---+---+---+---+

Upvotes: 0

Rafa
Rafa

Reputation: 527

Read with custom schema so that u can define what exact datatype you wanted.

        schema = StructType([ \
            StructField("COl1",StringType(),True), \
            StructField("COL2",DecimalType(20,10),True), \
            StructField("COL3",DecimalType(20,10),True)
        ])

        df = spark.read.schema(schema).csv(file_path)

Upvotes: 1

Related Questions