Reputation: 11
I have a query about loading the schema onto cdsw using pyspark. I have a dataframe which is created using a csv file
data_1 = spark.read.csv("demo.csv",sep = ",", header = True, inferSchema = True)
The data types are read wrong for most of the variable i.e around 60 of them, I can't change them manually all the time. I know what the schema must look like.
Is there any way, I could load the schema as well from a csv file? Like it could read the dataset and override the schema which I am uploading.
Upvotes: 0
Views: 3416
Reputation: 6082
You can load the schema.csv
and build an actual schema programmatically, then use it to load actual data.
Notes: The types in schema.csv
must match with Spark datatypes
import pandas as pd
from pyspark.sql.types import *
# schema.csv
# variable,data_type
# V1,Double
# V2,String
# V3,Double
# V4,Integer
# data.csv
# V1,V2,V3,V4
# 1.2,a,3.4,5
dtypes = pd.read_csv('schema.csv').to_records(index=False).tolist()
fields = [T.StructField(dtype[0], globals()[f'{dtype[1]}Type']()) for dtype in dtypes]
schema = StructType(fields)
df = spark.read.csv('data.csv', header=True, schema=schema)
df.printSchema()
# root
# |-- V1: double (nullable = true)
# |-- V2: string (nullable = true)
# |-- V3: double (nullable = true)
# |-- V4: integer (nullable = true)
df.show()
# +---+---+---+---+
# | V1| V2| V3| V4|
# +---+---+---+---+
# |1.2| a|3.4| 5|
# +---+---+---+---+
Upvotes: 0
Reputation: 527
Read with custom schema so that u can define what exact datatype you wanted.
schema = StructType([ \
StructField("COl1",StringType(),True), \
StructField("COL2",DecimalType(20,10),True), \
StructField("COL3",DecimalType(20,10),True)
])
df = spark.read.schema(schema).csv(file_path)
Upvotes: 1