Reputation: 23
I am new to python and trying to extract the data from JSON object to table format using Dataframe
Sample structure of JSON
{
"subject": "Computer",
"records": [
{
"name": "Section A",
"elements": [
{
"type": "easy",
"name": "Section A Module 1",
"elements": [
{
"type": "text",
"name": "SectionQ1",
"title": "Some question?",
"readOnly": true
},
{
"type": "text",
"name": "SectionQ2",
"title": "some other question?",
"isRequired": true
}
]
}
],
"title": "Section A details"
},
{
"name": "Section B",
"elements": [
{
"type": "medium",
"name": "Section B Module 1",
"elements": [
{
"type": "radiogroup",
"name": "SectionQ1",
"title": "some question?",
"choices": [
{
"value": "1",
"text": "Choose 1"
},
{
"value": "2",
"text": "Choose 2"
},
{
"value": "3",
"text": "Choose 3"
}
],
"colCount": 3
},
{
"type": "radiogroup",
"name": "SectionQ2",
"title": "some other question?",
"description": "question 2",
"isRequired": true,
"choices": [
{
"value": "value 1",
"text": "choice 1"
},
{
"value": "value 2",
"text": "choice 2"
},
{
"value": "value 3",
"text": "choice 3"
}
]
}
],
"title": "title"
}
],
"title": "Title"
}
]
}
I'm trying to get inner elements
array as table linking to outer elements
.
so table could be inner_elements
, outer_elements
, records
.
I have tried to below code along with few other approaches.
df = pd.read_json (r'sample.json')
df.columns = df.columns.map(lambda x: x.split(".")[-1])
print(df)
The expected output will be like
dataframe records
Name Title
Section A Section A details
Section B Section B details
dataframe elements1
Key(records) Type Name
Section A easy Section A Module 1
Section B medium Section B Module 1
dataframe elements2
Key(elements1) type name title readOnly Description isRequired colCount
Section A Module 1 text SectionQ1 Some question? true
Section A Module 1 text SectionQ2 some other question? true
Section B Module 1 radiogroup SectionQ1 some question? 3
Section B Module 1 radiogroup SectionQ2 some other question? question 2 true
dataframe choice
Key(elements2) type name value text
Section B Module 1 radiogroup SectionQ1 1 Choose 1
Section B Module 1 radiogroup SectionQ1 2 Choose 2
Section B Module 1 radiogroup SectionQ1 3 Choose 3
Section B Module 1 radiogroup SectionQ2 value 1 choice 1
Section B Module 1 radiogroup SectionQ2 value 2 choice 2
Section B Module 1 radiogroup SectionQ2 value 3 choice 3
However not getting any clue to could I proceed further. Please guide me to achieve the same
Upvotes: 2
Views: 138
Reputation: 3001
To extract nested arrays from a JSON object, I'd like to do something similar to MongoDB's unwind
. In fact, I'd try to mimic Mongo's aggregation pipelines. The package toolz
can really help with that.
First there's the explode operation, where a record containing an array field gets expanded into a sequence of records identical to the original but with the value for the key in question replaced by the corresponding array element.
Something like this:
from toolz.dicttoolz import assoc_in, get_in
def explode(d, keys):
values = get_in(keys, d)
if isinstance(values, list):
for v in values:
yield assoc_in(d, keys, v)
else:
yield d
record = {"x": 1, "y": [1, 2, 3]}
assert list(explode(record, ["y"])) == [{"x": 1, "y": 1}, {"x": 1, "y": 2}, {"x": 1, "y": 3}]
To create the unwind
stage, the explode
operation needs to be applied a sequence of records:
from toolz.itertoolz import concat
def unwind(iterator, keys):
return concat(map(lambda d: explode(d, keys), iterator))
records = [{"x": 1, "y": [1, 2, 3]}, {"x": 2, "y": [4, 5]}]
assert list(unwind(records, ["y"])) == [
{"x": 1, "y": 1},
{"x": 1, "y": 2},
{"x": 1, "y": 3},
{"x": 2, "y": 4},
{"x": 2, "y": 5},
]
Now you can create the pipelines to extract the tables you want:
from toolz import curried
from toolz.functoolz import pipe
sections = pipe(data["records"],
curried.map(lambda d: {
"section": get_in(["name"], d),
"title": get_in(["title"], d),
}),
list,
)
modules = pipe(data["records"],
lambda i: unwind(i, ["elements"]),
curried.map(lambda d: {
"section": get_in(["name"], d),
"type": get_in(["elements", "type"], d),
"module": get_in(["elements", "name"], d),
}),
list,
)
questions = pipe(data["records"],
lambda i: unwind(i, ["elements"]),
lambda i: unwind(i, ["elements", "elements"]),
curried.map(lambda d: {
"module": get_in(["elements", "name"], d),
"type": get_in(["elements", "elements", "type"], d),
"question": get_in(["elements", "elements", "name"], d),
"title": get_in(["elements", "elements", "title"], d),
"readOnly": get_in(["elements", "elements", "readOnly"], d),
"description": get_in(["elements", "elements", "description"], d),
"isRequired": get_in(["elements", "elements", "isRequired"], d),
"colCount": get_in(["elements", "elements", "colCount"], d),
}),
list,
)
choices = pipe(data["records"],
lambda i: unwind(i, ["elements"]),
lambda i: unwind(i, ["elements", "elements"]),
lambda i: unwind(i, ["elements", "elements", "choices"]),
curried.filter(lambda d: get_in(["elements", "elements", "choices"], d) is not None),
curried.map(lambda d: {
"module": get_in(["elements", "name"], d),
"type": get_in(["elements", "elements", "type"], d),
"question": get_in(["elements", "elements", "name"], d),
"value": get_in(["elements", "elements", "choices", "value"], d),
"text": get_in(["elements", "elements", "choices", "text"], d),
}),
list,
)
Now you can create the dataframes:
import pandas as pd
df_sections = pd.DataFrame.from_records(sections)
df_modules = pd.DataFrame.from_records(modules)
df_questions = pd.DataFrame.from_records(questions)
df_choices = pd.DataFrame.from_records(choices)
Upvotes: 1