Reputation: 4477
I loaded the following yaml and I'm trying to parse it to a dataframe
data = {
"clk": {
"imgt": {
"human": [
"IGHV1-2*02",
"IGKV1-33*01",
"IGKJ3*01",
"IGKJ4*01",
"IGKJ4*02",
"IGHJ2*01",
"IGHJ3*02",
"IGHJ5*02",
"IGHD3-10*01",
"IGHD3-16*02",
"IGHD6-13*01",
"IGKV1-5*03",
"IGHJ4*02",
"IGHD3-9*01",
"IGLV2-11*01",
"IGLJ1*01",
],
"mouse": [
"IGHV1-11*01",
"IGHV1-12*01",
"IGHV1-13*01",
"IGHV1-14*01",
"IGHV1-15*01",
"IGHV1-16*01",
"IGHV1-17-1*01",
"IGHV1-18*01",
"IGHV1-18*02",
"IGHV1-18*03",
"IGHV1-19*01",
"IGLJ5*01",
],
}
}
}
def inefficient_normalizer(data):
dataframe_loader = []
for name in data:
for source in data.get(name):
for species in data.get(name).get(source):
dataframe_loader.append(
{
"name": name,
"source": source,
"species": species,
"genes": data.get(name).get(source).get(species),
}
)
return pd.DataFrame(dataframe_loader).explode("genes").reset_index(drop=True)
>>>print(inefficient_normalizer)
name source species genes
0 clk imgt human IGHV1-2*02
1 clk imgt human IGKV1-33*01
2 clk imgt human IGKJ3*01
3 clk imgt human IGKJ4*01
4 clk imgt human IGKJ4*02
5 clk imgt human IGHJ2*01
6 clk imgt human IGHJ3*02
7 clk imgt human IGHJ5*02
...
However, I'm wondering how I can do this with the builtin json_normalize like I see here. The difference is that I only have a key for each level and not a key pair. Right now I get
import pandas as pd
pd.json_normalize(data)
clk.imgt.human clk.imgt.mouse
0 [IGHV1-2*02, IGKV1-33*01, IGKJ3*01, IGKJ4*01, ... [IGHV1-11*01, IGHV1-12*01, IGHV1-13*01, IGHV1-...
Any tips? I won't know the column names ahead of time.
Upvotes: 1
Views: 421
Reputation: 154
You could convert the data from json_normalize like this:
import pandas as pd
data = {
"clk": {
"imgt": {
"human": [
"IGHV1-2*02",
"IGKV1-33*01",
"IGKJ3*01",
"IGKJ4*01",
"IGKJ4*02",
"IGHJ2*01",
"IGHJ3*02",
"IGHJ5*02",
"IGHD3-10*01",
"IGHD3-16*02",
"IGHD6-13*01",
"IGKV1-5*03",
"IGHJ4*02",
"IGHD3-9*01",
"IGLV2-11*01",
"IGLJ1*01",
],
"mouse": [
"IGHV1-11*01",
"IGHV1-12*01",
"IGHV1-13*01",
"IGHV1-14*01",
"IGHV1-15*01",
"IGHV1-16*01",
"IGHV1-17-1*01",
"IGHV1-18*01",
"IGHV1-18*02",
"IGHV1-18*03",
"IGHV1-19*01",
"IGLJ5*01",
],
}
}
}
df = pd.json_normalize(data)
df = df.T
df.index = df.index.str.split(".").map(tuple)
df = df.reset_index().explode(0)
I transpose the table to make the columns into rows, then I split the column names at the dot and create tuples from the resulting lists, turning it into a multilevel index. Resetting the index and using explode to create one line per entry then does the trick.
Upvotes: 1