Noob
Noob

Reputation: 174

Pre-define datatype of dataframe while reading json

I have one json file with 100 columns and I want to read all columns along with predefined datatype of two columns.

I know that I could do this with schema option:

struct1 = StructType([StructField("npi", StringType(), True), StructField("NCPDP", StringType(), True)

spark.read.json(path=abc.json, schema=struct1)

However, this code reads only two columns:

>>> df.printSchema()
root
 |-- npi: string (nullable = true)
 |-- NCPDP: string (nullable = true)

To use above code I have to give data type of all 100 columns. How can I solve this?

Upvotes: 0

Views: 317

Answers (2)

Shaido
Shaido

Reputation: 28367

You can read all the data first and then convert the two columns in question:

df = spark.read.json(path=abc.json)
df.withColumn("npi", df["npi"].cast("string"))\
  .withColumn("NCPDP", df["NCPDP"].cast("string"))

Upvotes: 0

Steven
Steven

Reputation: 15283

According to official documentation, schema can be either a StructType or a String.

I can advice you 2 solutions :

1 - You use the schema of a dummy file

If you have one light file with the same schema (ie one line same structure), you can read it as Dataframe and then use the schema for your other json files :

df = spark.read.json("/path/to/dummy/file.json")
schm = df.schema
df = spark.read.json(path="abc.json", schema=schm)

2 - You generate the schema

This step needs you to provide column name (and maybe types too). Let's assume col is a dict with (key, value) as (column name, column type).

col_list = ['{col_name} {col_type}'.format(
    col_name=col_name,
    col_type=col_type,
) for col_name, col_type in col.items()]
schema_string = ', '.join(col_list)
df = spark.read.json(path="abc.json", schema=schema_string)

Upvotes: 1

Related Questions