Reputation: 15258
Currently, if i want to read a json with pyspark, either I use interfered schema, or I have to define manually my schema StructType
Is it possible to use a file as reference for the schema ?
Upvotes: 0
Views: 261
Reputation: 21766
You can indeed use a file to define your schema. For example, for the following schema:
TICKET:string
TRANSFERRED:string
ACCOUNT:integer
you can use this code to import it:
import csv
from collections import OrderedDict
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
schema = OrderedDict()
with open(r'schema.txt') as csvfile:
schemareader = csv.reader(csvfile, delimiter=':')
for row in schemareader:
schema[row[0]]=row[1]
and then you can use it to create your StructType
schema on the fly:
mapping = {"string": StringType, "integer": IntegerType}
schema = StructType([
StructField(k, mapping.get(v.lower())(), True) for (k, v) in schema.items()])
You may have to create a more complex schema file for JSON file, however, please note that you can't use a JSON file to define your schema as the order of the columns is not guaranteed when parsing JSON.
Upvotes: 2