Reputation: 77
I searched around for some of the similarly worded questions, but I didn't find a solution, so I am making my own question.
I am using Python. I am trying to create a DataFrame from a nested dictionary (with lists), and I can't get the length of the columns correct.
My data structure is:
[
{
"Info":{
"StudentID":1000009,
"firstName":"Name",
"middleName":"Middle",
"lastName":"Last"
},
"Pass":[
{
"Subject":"Math",
"Score":98
},
{
"Subject":"Science",
"Score":100
}
]
},
{
"Info":{
"StudentID":1000010,
"firstName":"Name",
"middleName":"Middle",
"lastName":"Last"
},
"Pass":[
{
"Subject":"Math",
"Score":90
},
{
"Subject":"Science",
"Score":82
},
{
"Subject":"English",
"Score":99
}
]
}
]
I want the DataFrame to be:
ID firstName middleName lastName Subject1 Subject2 Subject3
0 1000009 Name Middle Last Math Science NaN
0 1000009 Name Middle Last Math Science English
The data is more complex (more subjects), but what ends up happening is that I get an error that all arrays must be the same length:
ValueError: All arrays must be of the same length
I tried:
p = defaultdict(list)
num = len(data)
for i in range(0,num):
crd = data[i]['Info']['individualId']
p['crd'].append(crd)
firstName = data[i]['Info']['firstName']
p['firstName'].append(firstName)
middleName = data[i]['Info']['middleName ']
p['firstName'].append(middleName)
lastName = data[i]['Info']['lastName']
p['lastName'].append(lastName)
subjects= len(data[i]['Pass'])
for e in range(0,subjects):
try:
exam = data[i]['Pass'][e]['Subject']
p[f'Subject{e}'].append(subject)
except:
break
df = pd.DataFrame.from_dict(p, orient='index')
df
But the data isn't aligned - it must be tied to the correct ID. Instead, it lists the info in the order that it appears. In other words, there isn't a missing value in Subject3.
I've also tried creating a list, then creating a new list within the for loop.
rec = []
num = len(data)
for i in range(0,num):
p = []
crd = data[i]['Info']['individualId']
p.append(crd)
firstName = data[i]['Info']['firstName']
p.append(firstName)
middleName = data[i]['Info']['middleName ']
p.append(middleName)
lastName = data[i]['Info']['lastName']
p.append(lastName)
subjects= len(data[i]['Pass'])
for e in range(0,subjects):
try:
exam = data[i]['Pass'][e]['Subject']
p.append(subject)
except:
break
rec.append(p)
df = pd.DataFrame(rec)
df
In this code, I get misaligned info. Each column doesn't have the standard info. For example, if someone doesn't have a middle name in the data, everything will get shifted to the left.
Any solutions?
Upvotes: 0
Views: 527
Reputation: 150735
For your data, you can use json_normalize
, then a bit of manipulation:
# put data into a frame
tmp = pd.json_normalize(data)
passes = tmp['Pass'].explode()
out = tmp.drop(columns='Pass').join(pd.DataFrame(passes.tolist(), index=passes.index))
Then out
is:
Info.StudentID Info.firstName Info.middleName Info.lastName Subject Score
0 1000009 Name Middle Last Math 98
0 1000009 Name Middle Last Science 100
1 1000010 Name Middle Last Math 90
1 1000010 Name Middle Last Science 82
1 1000010 Name Middle Last English 99
From here, you can pivot the Subject
Upvotes: 1