Python Refactor JSON into different JSON Structure

Question

I have a bunch of JSON data that I did mostly by hand. Several thousand lines. I need to refactor it into a totally different format using Python.

An overview of my 'stuff':

Column: The basic 'unit' of my data. Each Column has attributes. Don't worry about the meaning of the attributes, but the attributes need to be retained for each Column if they exist.

Folder: Folders group Columns and other Folders together. The folders currently have no attributes, they (currently) only contain other Folder and Column objects (Object does not necessarily refer to JSON objects here... more of an 'entity')

Universe: Universes group everything into big chunks which, in the larger scope of my project, are unable to interact with each other. That is not important here, but that's what they do.

Some limitations:

Columns cannot contain other Column objects, Folder objects, or Universe objects.
Folders cannot contain Universe objects.
Universes cannot contain other Universe objects.

Currently, I have Columns in this form:

"Column0Name": {
  "type": "a type",
  "dtype": "data type",
  "description": "abcdefg"
}

and I need it to go to:

{
  "name": "Column0Name",
  "type": "a type",
  "dtype": "data type",
  "description": "abcdefg"
}

Essentially I need to convert the Column key-value things to an array of things (I am new to JSON, don't know the terminology). I also need each Folder to end up with two new JSON arrays (in addition to the "name": "FolderName" key-value pair). It needs a "folders": [] and "columns": [] to be added. So I have this for folders:

"Folder0Name": {
  "Column0Name": {
    "type": "a",
    "dtype": "b",
    "description": "c"
  },
  "Column1Name": {
    "type": "d",
    "dtype": "e",
    "description": "f"
  }
}

and need to go to this:

{
  "name": "Folder0Name",
  "folders": [],
  "columns": [
    {"name": "Column0Name", "type": "a", "dtype": "b", "description": "c"},
    {"name": "Column1Name", "type": "d", "dtype": "e", "description": "f"}
  ]
}

The folders will also end up in an array inside its parent Universe. Likewise, each Universe will end up with "name", "folders", and "columns" things. As such:

{
  "name": "Universe0",
  "folders": [a bunch of folders in a JSON array],
  "columns": [occasionally some columns in a JSON array]
}

Bottom line:

I'm going to guess that I need a recursive function to iterate though all the nested dictionaries after I import the JSON data with the json Python module.
I'm thinking some sort of usage of yield might help but I'm not super familiar yet with it.
Would it be easier to update the dicts as I go, or destroy each key-value pairs and construct an entirely new dict as I go?

Here is what I have so far. I'm stuck on getting the generator to return actual dictionaries instead of a generator object.

import json


class AllUniverses:
    """Container to hold all the Universes found in the json file"""
    def __init__(self, filename):
        self._fn = filename
        self.data = {}
        self.read_data()

    def read_data(self):
        with open(self._fn, 'r') as fin:
            self.data = json.load(fin)
        return self

    def universe_key(self):
        """Get the next universe key from the dict of all universes

            The key will be used as the name for the universe.
        """
        yield from self.data


class Universe:
    def __init__(self, json_filename):
        self._au = AllUniverses(filename=json_filename)
        self.uni_key = self._au.universe_key()
        self._universe_data = self._au.data.copy()
        self._col_attrs = ['type', 'dtype', 'description', 'aggregation']
        self._folders_list = []
        self._columns_list = []
        self._type = "Universe"
        self._name = ""
        self.uni = dict()
        self.is_folder = False
        self.is_column = False

    def output(self):
        # TODO: Pass this to json.dump?
        # TODO: Still need to get the actual folder and column dictionaries
        #  from the generators
        out = {
            "name": self._name,
            "type": "Universe",
            "folder": [f.me for f in self._folders_list],
            "columns": [c.me for c in self._columns_list]}
        return out

    def update_universe(self):
        """Get the next universe"""
        universe_k = next(self.uni_key)
        self._name = str(universe_k)
        self.uni = self._universe_data.pop(universe_k)
        return self

    def parse_nodes(self):
        """Process all child nodes"""
        nodes = [_ for _ in self.uni.keys()]
        for k in nodes:
            v = self.uni.pop(k)
            self._is_column(val=v)
            if self.is_column:
                fc = Column(data=v, key_name=k)
                self._columns_list.append(fc)
            else:
                fc = Folder(data=v, key_name=k)
                self._folders_list.append(fc)
        return self

    def _is_column(self, val):
        """Determine if val is a Column or Folder object"""
        self.is_folder = False
        self._column = False
        if isinstance(val, dict) and not val:
            self.is_folder = True
        elif not isinstance(val, dict):
            raise TypeError('Cannot handle inputs not of type dict')
        elif any([i in val.keys() for i in self._col_attrs]):
            self._column = True
        else:
            self.is_folder = True
        return self

    def parse_children(self):
        for folder in self._folders_list:
            assert(isinstance(folder, Folder)), f'bletch idk what happened'
            folder.parse_nodes()


class Folder:
    def __init__(self, data, key_name):
        self._data = data.copy()
        self._name = str(key_name)
        self._node_keys = [_ for _ in self._data.keys()]
        self._folders = []
        self._columns = []
        self._col_attrs = ['type', 'dtype', 'description', 'aggregation']

    @property
    def me(self):
        # maybe this should force the code to parse all children of this
        # Folder? Need to convert the generator into actual dictionaries
        return {"name": self._name, "type": "Folder",
                "columns": [(c.me for c in self._columns)],
                "folders": [(f.me for f in self._folders)]}

    def parse_nodes(self):
        """Parse all the children of this Folder

            Parse through all the node names. If it is detected to be a Folder
            then create a Folder obj. from it and add to the list of Folder
            objects. Else create a Column obj. from it and append to the list
            of Column obj.

            This should be appending dictionaries
        """
        for key in self._node_keys:
            _folder = False
            _column = False
            values = self._data.copy()[key]

            if isinstance(values, dict) and not values:
                _folder = True
            elif not isinstance(values, dict):
                raise TypeError('Cannot handle inputs not of type dict')
            elif any([i in values.keys() for i in self._col_attrs]):
                _column = True
            else:
                _folder = True
            if _folder:
                f = Folder(data=values, key_name=key)
                self._folders.append(f.me)
            else:
                c = Column(data=values, key_name=key)
                self._columns.append(c.me)
        return self


class Column:
    def __init__(self, data, key_name):
        self._data = data.copy()
        self._stupid_check()
        self._me = {
            'name': str(key_name),
            'type': 'Column',
            'ctype': self._data.pop('type'),
            'dtype': self._data.pop('dtype'),
            'description': self._data.pop('description'),
            'aggregation': self._data.pop('aggregation')}

    def __str__(self):
        # TODO: pretty sure this isn't correct
        return str(self.me)

    @property
    def me(self):
        return self._me

    def to_json(self):
        # This seems to be working? I think?
        return json.dumps(self, default=lambda o: str(self.me))  # o.__dict__)

    def _stupid_check(self):
        """If the key isn't in the dictionary, add it"""
        keys = [_ for _ in self._data.keys()]
        keys_defining_a_column = ['type', 'dtype', 'description', 'aggregation']
        for json_key in keys_defining_a_column:
            if json_key not in keys:
                self._data[json_key] = ""
        return self


if __name__ == "__main__":
    file = r"dummy_json_data.json"
    u = Universe(json_filename=file)
    u.update_universe()
    u.parse_nodes()
    u.parse_children()
    print('check me')

And it gives me this:

{
    "name":"UniverseName",
    "type":"Universe",
    "folder":[
        {"name":"Folder0Name",
            "type":"Folder",
            "columns":[. at 0x000001ACFBEDB0B0>],
            "folders":[. at 0x000001ACFBEDB190>]
        },
        {"name":"Folder2Name",
            "type":"Folder",
            "columns":[. at 0x000001ACFBEDB040>],
            "folders":[. at 0x000001ACFBEDB120>]
        },
        {"name":"Folder4Name",
            "type":"Folder",
            "columns":[. at 0x000001ACFBEDB270>],
            "folders":[. at 0x000001ACFBEDB200>]
        },
        {"name":"Folder6Name",
            "type":"Folder",
            "columns":[. at 0x000001ACFBEDB2E0>],
            "folders":[. at 0x000001ACFBEDB350>]
        },
        {"name":"Folder8Name",
            "type":"Folder",
            "columns":[. at 0x000001ACFBEDB3C0>],
            "folders":[. at 0x000001ACFBEDB430>]
        }
    ],
    "columns":[]
}

If there is an existing tool for this kind of transformation so that I don't have to write Python code, that would be an attractive alternative, too.

Adirio · Accepted Answer

Lets create the 3 classes needed to represent Columns, Folders and Unverses. Before starting some topics I wanna talk about, I give a short description of them here, if any of them is new to you I can go deeper:

I will use type annotations to make clear what type each variable is.
I am gonna use __slots__. By telling the Column class that its instances are gonna have a name, ctype, dtype, description and aggragation attributes, each instance of Column will require less memory space. The downside is that it will not accept any other attribute not listed there. This is, it saves memory but looses flexibility. As we are going to have several (maybe hundreds or thousands) of instances, reduced memory footprint seems more important than the flexibility of being able to add any attribute.
Each class will have the standard constructor where every argument has a default value except name, which is mandatory.
Each class will have another constructor called from_old_syntax. It is going to be a class method that receives the string corresponding to the name and a dict corresponding to the data as its arguments and outputs the corresponding instance (Column, Folder or Universe).
Universes are basically the same as Folders with different names (for now) so it will basically inherit it (class Universe(Folder): pass).

from typing import List


class Column:
    __slots__ = 'name', 'ctype', 'dtype', 'description', 'aggregation'

    def __init__(
        self,
        name: str,
        ctype: str = '',
        dtype: str = '',
        description: str = '',
        aggregation: str = '',
    ) -> None:
        self.name = name
        self.ctype = ctype
        self.dtype = dtype
        self.description = description
        self.aggregation = aggregation

    @classmethod
    def from_old_syntax(cls, name: str, data: dict) -> "Column":
        column = cls(name)
        for key, value in data.items():
            # The old syntax used type for column type but in the new syntax it
            # will have another meaning so we use ctype instead
            if key == 'type':
                key = 'ctype'
            try:
                setattr(column, key, value)
            except AttributeError as e:
                raise AttributeError(f"Unexpected key {key} for Column") from e
        return column


class Folder:
    __slots__ = 'name', 'folders', 'columns'

    def __init__(
        self,
        name: str,
        columns: List[Column] = None,
        folders: List["Folder"] = None,
    ) -> None:
        self.name = name
        if columns is None:
            self.columns = []
        else:
            self.columns = [column for column in columns]
        if folders is None:
            self.folders = []
        else:
            self.folders = [folder for folder in folders]

    @classmethod
    def from_old_syntax(cls, name: str, data: dict) -> "Folder":
        columns = []  # type: List[Column]
        folders = []  # type: List["Folder"]
        for key, value in data.items():
            # Determine if it is a Column or a Folder
            if 'type' in value and 'dtype' in value:
                columns.append(Column.from_old_syntax(key, value))
            else:
                folders.append(Folder.from_old_syntax(key, value))
        return cls(name, columns, folders)


class Universe(Folder):
    pass

As you can see the constructors are pretty trivial, assign the arguments to the attributes and done. In the case of Folders (and thus in Universes too), two arguments are lists of columns and folders. The default value is None (in this case we initialize as an empty list) because using mutable variables as default values has some issues so it is good practice to use None as the default value for mutable variables (such as lists).

Column's from_old_syntax class method creates an empty Column with the provided name. Afterwards we iterate over the data dict that was also provided and assign its key value pair to its corresponding attribute. There is a special case where "type" key is converted to "ctype" as "type" is going to be used for a different purpose with the new syntax. The assignation itself is done by setattr(column, key, value). We have included it inside a try ... except ... clause because as we said above, only the items in __slots__ can be used as attributes, so if there is an attribute that you forgot, you will get an exception saying "AttributeError: Unexpected key 'NAME'" and you will only have to add that "NAME" to the __slots__.

Folder's (and thus Unverse's) from_old_syntax class method is even simpler. Create a list of columns and folders, iterate over the data checking if it is a folder or a column and use the appropiate from_old_syntax class method. Then use those two lists and the provided name to return the instance. Notice that Folder.from_old_syntax notation is used to create the folders instead of cls.from_old_syntax because cls may be Universe. However, to create the insdance we do use cls(...) as here we do want to use Universe or Folder.

Now you could do universes = [Universe.from_old_syntax(name, data) for name, data in json.load(f).items()] where f is the file and you will get all your Universes, Folders and Columns in memory. So now we need to encode them back to JSON. For this we are gonna extend the json.JSONEncoder so that it knows how to parse our classes into dictionaries that it can encode normally. To do so, you just need to overwrite the default method, check if the object passed is of our classes and return a dict that will be encoded. If it is not one of our classes we will let the parent default method to take care of it.

import json


# JSON fields with this values will be omitted
EMPTY_VALUES = "", [], {}


class CustomEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (Column, Folder, Universe)):
            # Make a dict with every item in their respective __slots__
            data = {
                attr: getattr(obj, attr) for attr in obj.__slots__
                if getattr(obj, attr) not in EMPTY_VALUES
            }
            # Add the type fild with the class name
            data['type'] = obj.__class__.__name__
            return data

        # Use the parent class function for any object not handled explicitly
        super().default(obj)

Converting the classes to dictionaries is basically taking what is in __slots__ as the key and the attribute's value as the value. We will filter those values that are an empty string, an empty list or an empty dict as we do not need to write them to JSON. We finally add the "type" key to the dict by reading the objects class name (Column, Folder and Universe).

To use it you have to pass the CustomEncoder as the cls argument to json.dump.

So the code will look like this (omitting the class definitions to keep it short):

import json
from typing import List


# JSON fields with this values will be omitted
EMPTY_VALUES = "", [], {}


class Column:
    # ...


class Folder:
    # ...


class Universe(Folder):
    pass


class CustomEncoder(json.JSONEncoder):
    # ...


if __name__ == '__main__':
    with open('dummy_json_data.json', 'r') as f_in, open('output.json', 'w') as f_out:
        universes = [Universe.from_old_syntax(name, data)
                     for name, data in json.load(f_in).items()]
        json.dump(universes, f_out, cls=CustomEncoder, indent=4)

Python Refactor JSON into different JSON Structure

Answers (1)

Related Questions