Python Learner
Python Learner

Reputation: 23

Nested JSON to Dataframe using Python Pandas

I am new to python and trying to extract the data from JSON object to table format using Dataframe

Sample structure of JSON

{
    "subject": "Computer",
    "records": [
      {
        "name": "Section A",
        "elements": [
          {
            "type": "easy",
            "name": "Section A Module 1",
            "elements": [
              {
                "type": "text",
                "name": "SectionQ1",
                "title": "Some question?",
                "readOnly": true
              },
              {
                "type": "text",
                "name": "SectionQ2",
                "title": "some other question?",
                "isRequired": true
              }
            ]
          }
        ],
        "title": "Section A details"
      },
      {
        "name": "Section B",
        "elements": [
          {
            "type": "medium",
            "name": "Section B Module 1",
            "elements": [
              {
                "type": "radiogroup",
                "name": "SectionQ1",
                "title": "some question?",
                "choices": [
                  {
                    "value": "1",
                    "text": "Choose 1"
                  },
                  {
                    "value": "2",
                    "text": "Choose 2"
                  },
                  {
                    "value": "3",
                    "text": "Choose 3"
                  }
                ],
                "colCount": 3
              },
              {
                "type": "radiogroup",
                "name": "SectionQ2",
                "title": "some other question?",
                "description": "question 2",
                "isRequired": true,
                "choices": [
                  {
                    "value": "value 1",
                    "text": "choice 1"
                  },
                  {
                    "value": "value 2",
                    "text": "choice 2"
                  },
                  {
                    "value": "value 3",
                    "text": "choice 3"
                  }
                ]
              }
            ],
            "title": "title"
          }         
        ],
        "title": "Title"
      }
    ]
  }

I'm trying to get inner elements array as table linking to outer elements. so table could be inner_elements, outer_elements , records.

I have tried to below code along with few other approaches.

df = pd.read_json (r'sample.json')
df.columns = df.columns.map(lambda x: x.split(".")[-1])
print(df)

The expected output will be like



dataframe records

Name        Title
Section A   Section A details
Section B   Section B details


dataframe elements1

Key(records)    Type    Name
Section A       easy    Section A Module 1
Section B       medium  Section B Module 1


dataframe elements2

Key(elements1)          type        name        title                   readOnly    Description     isRequired  colCount
Section A Module 1      text        SectionQ1   Some question?          true
Section A Module 1      text        SectionQ2   some other question?    true
Section B Module 1      radiogroup  SectionQ1   some question?                                                  3
Section B Module 1      radiogroup  SectionQ2   some other question?                question 2      true


dataframe choice

Key(elements2)          type        name        value       text    
Section B Module 1      radiogroup  SectionQ1   1           Choose 1
Section B Module 1      radiogroup  SectionQ1   2           Choose 2
Section B Module 1      radiogroup  SectionQ1   3           Choose 3
Section B Module 1      radiogroup  SectionQ2   value 1     choice 1
Section B Module 1      radiogroup  SectionQ2   value 2     choice 2
Section B Module 1      radiogroup  SectionQ2   value 3     choice 3

However not getting any clue to could I proceed further. Please guide me to achieve the same

Upvotes: 2

Views: 138

Answers (1)

To extract nested arrays from a JSON object, I'd like to do something similar to MongoDB's unwind. In fact, I'd try to mimic Mongo's aggregation pipelines. The package toolz can really help with that.

First there's the explode operation, where a record containing an array field gets expanded into a sequence of records identical to the original but with the value for the key in question replaced by the corresponding array element.

Something like this:

from toolz.dicttoolz import assoc_in, get_in

def explode(d, keys):
    values = get_in(keys, d)
    if isinstance(values, list):
        for v in values:
            yield assoc_in(d, keys, v)
    else:
        yield d


record = {"x": 1, "y": [1, 2, 3]}
assert list(explode(record, ["y"])) == [{"x": 1, "y": 1}, {"x": 1, "y": 2}, {"x": 1, "y": 3}]

To create the unwind stage, the explode operation needs to be applied a sequence of records:

from toolz.itertoolz import concat

def unwind(iterator, keys):
    return concat(map(lambda d: explode(d, keys), iterator))

records = [{"x": 1, "y": [1, 2, 3]}, {"x": 2, "y": [4, 5]}]
assert list(unwind(records, ["y"])) == [
    {"x": 1, "y": 1}, 
    {"x": 1, "y": 2}, 
    {"x": 1, "y": 3},
    {"x": 2, "y": 4},
    {"x": 2, "y": 5},
]

Now you can create the pipelines to extract the tables you want:

from toolz import curried
from toolz.functoolz import pipe

sections = pipe(data["records"],
                curried.map(lambda d: {
                    "section": get_in(["name"], d),
                    "title": get_in(["title"], d),
                }),
                list,
)

modules = pipe(data["records"],
               lambda i: unwind(i, ["elements"]),
               curried.map(lambda d: {
                   "section": get_in(["name"], d),
                   "type": get_in(["elements", "type"], d),
                   "module": get_in(["elements", "name"], d),
               }),
               list,
)

questions = pipe(data["records"],
                 lambda i: unwind(i, ["elements"]),
                 lambda i: unwind(i, ["elements", "elements"]),
                 curried.map(lambda d: {
                     "module": get_in(["elements", "name"], d),
                     "type": get_in(["elements", "elements", "type"], d),
                     "question": get_in(["elements", "elements", "name"], d),
                     "title": get_in(["elements", "elements", "title"], d),
                     "readOnly": get_in(["elements", "elements", "readOnly"], d),
                     "description": get_in(["elements", "elements", "description"], d),
                     "isRequired": get_in(["elements", "elements", "isRequired"], d),
                     "colCount": get_in(["elements", "elements", "colCount"], d),
                 }),
                 list,
)

choices = pipe(data["records"],
               lambda i: unwind(i, ["elements"]),
               lambda i: unwind(i, ["elements", "elements"]),
               lambda i: unwind(i, ["elements", "elements", "choices"]),
               curried.filter(lambda d: get_in(["elements", "elements", "choices"], d) is not None),
               curried.map(lambda d: {
                   "module": get_in(["elements", "name"], d),
                   "type": get_in(["elements", "elements", "type"], d),
                   "question": get_in(["elements", "elements", "name"], d),
                   "value": get_in(["elements", "elements", "choices", "value"], d),
                   "text": get_in(["elements", "elements", "choices", "text"], d),
               }),
               list,
)

Now you can create the dataframes:

import pandas as pd

df_sections = pd.DataFrame.from_records(sections)
df_modules = pd.DataFrame.from_records(modules)
df_questions = pd.DataFrame.from_records(questions)
df_choices = pd.DataFrame.from_records(choices)

Upvotes: 1

Related Questions