Reputation: 174
I have one json file with 100 columns and I want to read all columns along with predefined datatype of two columns.
I know that I could do this with schema option:
struct1 = StructType([StructField("npi", StringType(), True), StructField("NCPDP", StringType(), True)
spark.read.json(path=abc.json, schema=struct1)
However, this code reads only two columns:
>>> df.printSchema()
root
|-- npi: string (nullable = true)
|-- NCPDP: string (nullable = true)
To use above code I have to give data type of all 100 columns. How can I solve this?
Upvotes: 0
Views: 317
Reputation: 28367
You can read all the data first and then convert the two columns in question:
df = spark.read.json(path=abc.json)
df.withColumn("npi", df["npi"].cast("string"))\
.withColumn("NCPDP", df["NCPDP"].cast("string"))
Upvotes: 0
Reputation: 15283
According to official documentation, schema can be either a StructType
or a String
.
I can advice you 2 solutions :
If you have one light file with the same schema (ie one line same structure), you can read it as Dataframe and then use the schema for your other json files :
df = spark.read.json("/path/to/dummy/file.json")
schm = df.schema
df = spark.read.json(path="abc.json", schema=schm)
This step needs you to provide column name (and maybe types too).
Let's assume col
is a dict with (key, value) as (column name, column type).
col_list = ['{col_name} {col_type}'.format(
col_name=col_name,
col_type=col_type,
) for col_name, col_type in col.items()]
schema_string = ', '.join(col_list)
df = spark.read.json(path="abc.json", schema=schema_string)
Upvotes: 1