Reputation: 11
I am new to spark and trying to read JSON file of the format below into a spark dataframe. This is the format of my JSON
"elements": [
Q4
{
Name:ABC,
Language:English,
Age:45,
Title:SWE
},
Q5
{
Name:DEF,
Language:English,
Age:60
Title: Engineer
},
Q6
{
Name:HIJ,
Language:English,
Age:57,
Title:
}
] I want the output to be
Name | Language | Age | Title
ABC | English | 45 | SWE
DEF | English | 60 | Engineer
HIJ | English | 57 | Null
How do I achieve this with pyspark?
Upvotes: 1
Views: 350
Reputation: 830
Please try using
df=spark.read.json()
to read the file. It will convert you data into the dataframe format. You may need to chose JSON element if you need the document inside the element.
--Edited part, If you want to use hard code string, pls refer spark doc: Example content from spark document.
sc = spark.sparkContext
jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()
# +---------------+----+
# | address|name|
# +---------------+----+
# |[Columbus,Ohio]| Yin|
# +---------------+----+
--Edit2 With your example but I picked only the required data to create dataframe here. I hope, this will work for you.
import os
import sys
from pyspark.sql import SparkSession
import json
from pyspark.sql import Row
spark = SparkSession.builder.master("local").getOrCreate()
json_doc1='{"elements": {"Q4":{"Name":"ABC","Language":"English","Age":45,"Title":"SWE"},"Q5": {"Name":"DEF","Language":"English","Age":60,"Title": "Engineer"}}}'
test=json.loads(json_doc1)
data1=test['elements'].values()
print (data1)
#rddd1= sc.parallelize()
df1=spark.createDataFrame(Row(**x) for x in data1)
df1.show()
+---+--------+----+--------+
|Age|Language|Name| Title|
+---+--------+----+--------+
| 60| English| DEF|Engineer|
| 45| English| ABC| SWE|
+---+--------+----+--------+
Thanks, Manu
Upvotes: 2