Reputation: 37
I'm currently trying to work with a JSON file with the following format:
response = {
"leads": [{
"id": 208827181,
"campaignId": 2595,
"contactId": 2919361,
"contactAttempts": 1,
"contactAttemptsInvalid": 0,
"lastModifiedTime": "2017-03-14T13:37:20Z",
"nextContactTime": "2017-03-15T14:37:20Z",
"created": "2017-03-14T13:16:42Z",
"updated": "2017-03-14T13:37:20Z",
"lastContactedBy": 1271,
"status": "automaticRedial",
"active": True,
"masterData": [{
"id": 2054,
"label": "Firmanavn",
"value": "Firma_1"
},
{
"id": 2055,
"label": "Adresse",
"value": "Gadenavn_1"
},
{
"id": 2056,
"label": "Postnr.",
"value": "2000"
},
{
"id": 2057,
"label": "Bydel",
"value": "Frederiksberg"
},
{
"id": 2058,
"label": "Telefonnummer",
"value": "25252525"
}
]
}]
}
masterData is in a nested list format but also varies in length. Basically, each row/entry can have different columns assigned to it. I'm looking to keep a specific column or columns for each entry. With my current indexing, however, due to the different lengths of the nested list, my indexing breaks. This is my code:
leads = json_normalize(response['leads'])
df = pd.concat([leads.drop('masterData', 1),
pd.DataFrame(list(pd.DataFrame(list(leads['masterData']))[4]))
.drop(['id', 'label'], 1)
.rename(columns={"value": "tlf"})], axis=1)
The desired output is:
active campaignId contactAttempts contactAttemptsInvalid contactId created id lastContactedBy lastModifiedTime nextContactTime resultData status updated tlf
0 True 2595 1 0 2919361 2017-03-14T13:16:42Z 208827181 1271.0 2017-03-14T13:37:20Z 2017-03-15T14:37:20Z [] automaticRedial 2017-03-14T13:37:20Z 37373737
1 True 2595 2 0 2919359 2017-03-14T13:16:42Z 208827179 1271.0 2017-03-14T13:33:30Z 2017-03-15T14:33:30Z [] privateRedial 2017-03-14T13:33:30Z 55555555
2 True 2595 1 0 2919360 2017-03-14T13:16:42Z 208827180 1271.0 2017-03-14T13:36:06Z None [] success 2017-03-14T13:36:06Z 22222222
3 True 2595 1 0 2919362 2017-03-14T13:16:42Z 208827182 1271.0 2017-03-14T13:56:39Z None [] success 2017-03-14T13:56:39Z 34343434
Where "tlf" is the added column from "masterData".
Upvotes: 1
Views: 214
Reputation: 863361
Use only json_normalize
with specify columns names in list:
L = ['active', 'campaignId', 'contactAttempts', 'contactAttemptsInvalid',
'contactId', 'created', 'id', 'lastContactedBy', 'lastModifiedTime',
'nextContactTime', 'status', 'updated']
df = json_normalize(response['leads'], 'masterData', L, record_prefix='masterData.')
print (df)
masterData.id masterData.label masterData.value active campaignId \
0 2054 Firmanavn Firma_1 True 2595
1 2055 Adresse Gadenavn_1 True 2595
2 2056 Postnr. 2000 True 2595
3 2057 Bydel Frederiksberg True 2595
4 2058 Telefonnummer 25252525 True 2595
contactAttempts contactAttemptsInvalid contactId created \
0 1 0 2919361 2017-03-14T13:16:42Z
1 1 0 2919361 2017-03-14T13:16:42Z
2 1 0 2919361 2017-03-14T13:16:42Z
3 1 0 2919361 2017-03-14T13:16:42Z
4 1 0 2919361 2017-03-14T13:16:42Z
id lastContactedBy lastModifiedTime nextContactTime \
0 208827181 1271 2017-03-14T13:37:20Z 2017-03-15T14:37:20Z
1 208827181 1271 2017-03-14T13:37:20Z 2017-03-15T14:37:20Z
2 208827181 1271 2017-03-14T13:37:20Z 2017-03-15T14:37:20Z
3 208827181 1271 2017-03-14T13:37:20Z 2017-03-15T14:37:20Z
4 208827181 1271 2017-03-14T13:37:20Z 2017-03-15T14:37:20Z
status updated
0 automaticRedial 2017-03-14T13:37:20Z
1 automaticRedial 2017-03-14T13:37:20Z
2 automaticRedial 2017-03-14T13:37:20Z
3 automaticRedial 2017-03-14T13:37:20Z
4 automaticRedial 2017-03-14T13:37:20Z
Upvotes: 1