How to efficiently normalize a yaml to pandas dataframe

Question

I loaded the following yaml and I'm trying to parse it to a dataframe

data = {
    "clk": {
        "imgt": {
            "human": [
                "IGHV1-2*02",
                "IGKV1-33*01",
                "IGKJ3*01",
                "IGKJ4*01",
                "IGKJ4*02",
                "IGHJ2*01",
                "IGHJ3*02",
                "IGHJ5*02",
                "IGHD3-10*01",
                "IGHD3-16*02",
                "IGHD6-13*01",
                "IGKV1-5*03",
                "IGHJ4*02",
                "IGHD3-9*01",
                "IGLV2-11*01",
                "IGLJ1*01",
            ],
            "mouse": [
                "IGHV1-11*01",
                "IGHV1-12*01",
                "IGHV1-13*01",
                "IGHV1-14*01",
                "IGHV1-15*01",
                "IGHV1-16*01",
                "IGHV1-17-1*01",
                "IGHV1-18*01",
                "IGHV1-18*02",
                "IGHV1-18*03",
                "IGHV1-19*01",
                "IGLJ5*01",
            ],
        }
    }
}

def inefficient_normalizer(data):
    dataframe_loader = []
    for name in data:
        for source in data.get(name):
            for species in data.get(name).get(source):
                dataframe_loader.append(
                    {
                        "name": name,
                        "source": source,
                        "species": species,
                        "genes": data.get(name).get(source).get(species),
                    }
                )
    return pd.DataFrame(dataframe_loader).explode("genes").reset_index(drop=True)


>>>print(inefficient_normalizer)
name source species          genes
0   clk   imgt   human     IGHV1-2*02
1   clk   imgt   human    IGKV1-33*01
2   clk   imgt   human       IGKJ3*01
3   clk   imgt   human       IGKJ4*01
4   clk   imgt   human       IGKJ4*02
5   clk   imgt   human       IGHJ2*01
6   clk   imgt   human       IGHJ3*02
7   clk   imgt   human       IGHJ5*02
...

However, I'm wondering how I can do this with the builtin json_normalize like I see here. The difference is that I only have a key for each level and not a key pair. Right now I get

import pandas as pd
pd.json_normalize(data)
                                      clk.imgt.human                                     clk.imgt.mouse
0  [IGHV1-2*02, IGKV1-33*01, IGKJ3*01, IGKJ4*01, ...  [IGHV1-11*01, IGHV1-12*01, IGHV1-13*01, IGHV1-...

Any tips? I won't know the column names ahead of time.

How to efficiently normalize a yaml to pandas dataframe

Answers (1)

Related Questions