Reputation: 9522
I have lists of stats produced in different runs, for each of my different samples:
d = {
"sample1": [
{"stat1": 'a', "stat2": 98}, # stats for sample1, 1st run
{"stat1": 'z', "stat2": 13}, # stats for sample1, 2nd run
],
"sample2": [
{"stat1": 'y', "stat2": 1089}, # stats for sample2, 1st run
{"stat1": 'a', "stat2": 1015}, # stats for sample2, 2nd run
],
}
And I am trying to create a DataFrame
out of this so stats can be easily manageable. For example, I would like to see the average of stat2 for a given sample. Or the most common stat1 value for all samples.
So df.loc["sample2"]
but return all "rows" of stats. df.loc[["sample1", 3]]
would just return the 4th run. df["stat1"]
would of course return the entire column for all samples and runs, and df.loc["sample1"]["stat2"]
the stat2 column for sample1. I hope I got the indexing right, I am not very familiar with pandas.
I can't manage to get it right. I have tried using pd.MultiIndex
but that didn't really work:
index = pd.MultiIndex.from_tuples(???, names=['sample', 'run'])
df = pd.DataFrame(d, columns=['stat1', 'stat2'], index=index)
I have tried pairing each sample with the number of runs like [("sample1", 0), ("sample1", 1), ("sample2", 0), ("sample2", 1)]
but that didn't really work out because the number of runs won't always be the same for each sample.
Also, all values were NaN
so I must be doing something wrong when passing the data. Shouldn't passing d
and the proper indices and columns be enough for the constructor to figure out how to populate the dataframe? How else should I do it then?
Upvotes: 2
Views: 1108
Reputation: 862671
I think you need concat
with dict comprehension
, if need change columns names of MultiIndex
add rename_axis
:
df = pd.concat({k:pd.DataFrame(v) for k, v in d.items()}).rename_axis(('sample','run'))
print (df)
stat1 stat2
sample run
sample1 0 a 98
1 z 13
sample2 0 y 1089
1 a 1015
Upvotes: 4