Pre-define datatype of dataframe while reading json

Question

I have one json file with 100 columns and I want to read all columns along with predefined datatype of two columns.

I know that I could do this with schema option:

struct1 = StructType([StructField("npi", StringType(), True), StructField("NCPDP", StringType(), True)

spark.read.json(path=abc.json, schema=struct1)

However, this code reads only two columns:

>>> df.printSchema()
root
 |-- npi: string (nullable = true)
 |-- NCPDP: string (nullable = true)

To use above code I have to give data type of all 100 columns. How can I solve this?

Steven · Accepted Answer

According to official documentation, schema can be either a StructType or a String.

I can advice you 2 solutions :

1 - You use the schema of a dummy file

If you have one light file with the same schema (ie one line same structure), you can read it as Dataframe and then use the schema for your other json files :

df = spark.read.json("/path/to/dummy/file.json")
schm = df.schema
df = spark.read.json(path="abc.json", schema=schm)

2 - You generate the schema

This step needs you to provide column name (and maybe types too). Let's assume col is a dict with (key, value) as (column name, column type).

col_list = ['{col_name} {col_type}'.format(
    col_name=col_name,
    col_type=col_type,
) for col_name, col_type in col.items()]
schema_string = ', '.join(col_list)
df = spark.read.json(path="abc.json", schema=schema_string)

Pre-define datatype of dataframe while reading json

Answers (2)

1 - You use the schema of a dummy file

2 - You generate the schema

Related Questions