use pre-defined schema in pyspark json

Question

Currently, if i want to read a json with pyspark, either I use interfered schema, or I have to define manually my schema StructType

Is it possible to use a file as reference for the schema ?

Alex · Accepted Answer

You can indeed use a file to define your schema. For example, for the following schema:

TICKET:string 
TRANSFERRED:string 
ACCOUNT:integer

you can use this code to import it:

import csv
from collections import OrderedDict 
from pyspark.sql.types import StructType, StructField, StringType,IntegerType

schema = OrderedDict()
with open(r'schema.txt') as csvfile:
    schemareader = csv.reader(csvfile, delimiter=':')
    for row in schemareader:
        schema[row[0]]=row[1]

and then you can use it to create your StructType schema on the fly:

mapping = {"string": StringType, "integer": IntegerType}

schema = StructType([
    StructField(k, mapping.get(v.lower())(), True) for (k, v) in schema.items()])

You may have to create a more complex schema file for JSON file, however, please note that you can't use a JSON file to define your schema as the order of the columns is not guaranteed when parsing JSON.

use pre-defined schema in pyspark json

Answers (1)

Related Questions