jwillis0720
jwillis0720

Reputation: 4477

How to efficiently normalize a yaml to pandas dataframe

I loaded the following yaml and I'm trying to parse it to a dataframe

data = {
    "clk": {
        "imgt": {
            "human": [
                "IGHV1-2*02",
                "IGKV1-33*01",
                "IGKJ3*01",
                "IGKJ4*01",
                "IGKJ4*02",
                "IGHJ2*01",
                "IGHJ3*02",
                "IGHJ5*02",
                "IGHD3-10*01",
                "IGHD3-16*02",
                "IGHD6-13*01",
                "IGKV1-5*03",
                "IGHJ4*02",
                "IGHD3-9*01",
                "IGLV2-11*01",
                "IGLJ1*01",
            ],
            "mouse": [
                "IGHV1-11*01",
                "IGHV1-12*01",
                "IGHV1-13*01",
                "IGHV1-14*01",
                "IGHV1-15*01",
                "IGHV1-16*01",
                "IGHV1-17-1*01",
                "IGHV1-18*01",
                "IGHV1-18*02",
                "IGHV1-18*03",
                "IGHV1-19*01",
                "IGLJ5*01",
            ],
        }
    }
}

def inefficient_normalizer(data):
    dataframe_loader = []
    for name in data:
        for source in data.get(name):
            for species in data.get(name).get(source):
                dataframe_loader.append(
                    {
                        "name": name,
                        "source": source,
                        "species": species,
                        "genes": data.get(name).get(source).get(species),
                    }
                )
    return pd.DataFrame(dataframe_loader).explode("genes").reset_index(drop=True)


>>>print(inefficient_normalizer)
name source species          genes
0   clk   imgt   human     IGHV1-2*02
1   clk   imgt   human    IGKV1-33*01
2   clk   imgt   human       IGKJ3*01
3   clk   imgt   human       IGKJ4*01
4   clk   imgt   human       IGKJ4*02
5   clk   imgt   human       IGHJ2*01
6   clk   imgt   human       IGHJ3*02
7   clk   imgt   human       IGHJ5*02
...

However, I'm wondering how I can do this with the builtin json_normalize like I see here. The difference is that I only have a key for each level and not a key pair. Right now I get

import pandas as pd
pd.json_normalize(data)
                                      clk.imgt.human                                     clk.imgt.mouse
0  [IGHV1-2*02, IGKV1-33*01, IGKJ3*01, IGKJ4*01, ...  [IGHV1-11*01, IGHV1-12*01, IGHV1-13*01, IGHV1-...

Any tips? I won't know the column names ahead of time.

Upvotes: 1

Views: 421

Answers (1)

DeepKling
DeepKling

Reputation: 154

You could convert the data from json_normalize like this:

import pandas as pd
data = {
    "clk": {
        "imgt": {
            "human": [
                "IGHV1-2*02",
                "IGKV1-33*01",
                "IGKJ3*01",
                "IGKJ4*01",
                "IGKJ4*02",
                "IGHJ2*01",
                "IGHJ3*02",
                "IGHJ5*02",
                "IGHD3-10*01",
                "IGHD3-16*02",
                "IGHD6-13*01",
                "IGKV1-5*03",
                "IGHJ4*02",
                "IGHD3-9*01",
                "IGLV2-11*01",
                "IGLJ1*01",
            ],
            "mouse": [
                "IGHV1-11*01",
                "IGHV1-12*01",
                "IGHV1-13*01",
                "IGHV1-14*01",
                "IGHV1-15*01",
                "IGHV1-16*01",
                "IGHV1-17-1*01",
                "IGHV1-18*01",
                "IGHV1-18*02",
                "IGHV1-18*03",
                "IGHV1-19*01",
                "IGLJ5*01",
            ],
        }
    }
}
df = pd.json_normalize(data)
df = df.T
df.index = df.index.str.split(".").map(tuple)
df = df.reset_index().explode(0)

I transpose the table to make the columns into rows, then I split the column names at the dot and create tuples from the resulting lists, turning it into a multilevel index. Resetting the index and using explode to create one line per entry then does the trick.

Upvotes: 1

Related Questions