Nam Nguyễn Văn
Nam Nguyễn Văn

Reputation: 1

Converting Nested JSON to DataFrame

I try to converting Nested JSON to DataFrame with the spark.read.option("multiline", "true").json(file_path) but with this code it will order columns name alphabetically, its not what im expected.

When i use the spark.read.option("multiline", "true").json the dataframe will return this columns order: CreatedAt,CreatedBy,IsDeleted,ModifiedAt,ModifiedBy,TypeName, id

The columns order i expected like the order in the .json file: Id,TypeName,CreatedBy,CreatedAt,ModifiedBy,ModifiedAt,IsDeleted

How can i read the nest json with multiline and get the same columns order with the json file. PS: I don't want to manual define the schema ( i want to dynamic the schema)

from pyspark.sql.types import StructType
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode
import json

# Initialize SparkSession
spark = SparkSession.builder \
.appName("Read JSON to DataFrame") \
.getOrCreate()

 # Read JSON file into DataFrame to infer schema
 json_df = spark.read.option("multiline", "true").json(FILE_PATH)

 #Get only 'data' component in json file
 exploded_df = json_df.select(explode("data").alias("data"))


 # Select the fields from the exploded DataFrame
 data_df = exploded_df.select("data.*")
 data_df.show()

json_data

Upvotes: 0

Views: 45

Answers (1)

Nishu Tayal
Nishu Tayal

Reputation: 20840

By default, columns are sorted alphabatically while reading the JSON file. There is no default configuration to specify in the spark.read.option(...) to define the ordered keys.

If you want to retain the column order, you can either:

  • Specify the schema while reading the file with spark.read.schema(schema).json(..)
  • Use the select() function to define the column order as next step

Upvotes: 0

Related Questions