Reputation: 30
I have a set of 3 JSON files which have an identical lay out. The number is expected to be much more when I push code into production, I only use 3 to keep the workflow fast.
The JSON files are structured like this
{
"results": [
{
"engagement": {
"id": 2342,
"portalId": 23423,
"active": true,
"createdAt": 1661855667536,
"lastUpdated": 1661935264761,
"modifiedBy": 3453
},
"associations": {
"contactIds": [
00000
],
"companyIds": [],
"dealIds": []
},
"attachments": [],
"scheduledTasks": [],
"metadata": {
"status": "COMPLETED",
"forObjectType": "CONTACT",
"subject": "DEMO"
}
},
],
"hasMore": true,
"offset": 520,
"total": 10523
}
There could be up to 250 records in the 'Results' header, starting with 'engagement'.
I am trying to find a way to merge all 3 JSON files, with Python, where I contain only the data in "results" and drop the rest.
So far I am able to either add all 3 JSON's together but they are still separated by the different "results" headers or the last JSON overwrites the previously made file and I am no further.
The expected results would look like this:
[
{
"engagement": {
"id": 2342,
"portalId": 23423,
"active": true,
"createdAt": 1661855667536,
"lastUpdated": 1661935264761,
"modifiedBy": 3453
},
"associations": {
"contactIds": [
00000
],
"companyIds": [],
"dealIds": []
},
"attachments": [],
"scheduledTasks": [],
"metadata": {
"status": "COMPLETED",
"forObjectType": "CONTACT",
"subject": "DEMO"
}
},
],
[
{
"engagement": {
"id": 2342,
"portalId": 23423,
"active": true,
"createdAt": 1661855667536,
"lastUpdated": 1661935264761,
"modifiedBy": 3453
},
"associations": {
"contactIds": [
00000
],
"companyIds": [],
"dealIds": []
},
"attachments": [],
"scheduledTasks": [],
"metadata": {
"status": "COMPLETED",
"forObjectType": "CONTACT",
"subject": "DEMO"
}
},
],
[
{
"engagement": {
"id": 2342,
"portalId": 23423,
"active": true,
"createdAt": 1661855667536,
"lastUpdated": 1661935264761,
"modifiedBy": 3453
},
"associations": {
"contactIds": [
00000
],
"companyIds": [],
"dealIds": []
},
"attachments": [],
"scheduledTasks": [],
"metadata": {
"status": "COMPLETED",
"forObjectType": "CONTACT",
"subject": "DEMO"
}
},
],
Any help will be much appericiated.
Upvotes: 0
Views: 342
Reputation: 4391
This is relatively simple, but I would restructure the resulting JSON a bit as the current structure does no make much sense.
What the code below does is simply load the files, and add all the list elements from the results
dict to the final_result
list. Basically now you have a list where each element of the list contains the part you needed from the original JSON files.
Then it saves that to a new file.
import json
filelist = ["file1.json", "file2.json", "file3.json"]
final_result = []
for filename in filelist:
with open(filename) as infile:
newdata = json.load(infile)
final_result.extend(newdata["results"])
with open("result.json", "w") as outfile:
json.dump(final_result, outfile, indent=4)
result.json
[
{
"engagement": {
"id": 1,
"portalId": 23423,
"active": true,
"createdAt": 1661855667536,
"lastUpdated": 1661935264761,
"modifiedBy": 3453
},
"associations": {
"contactIds": [
21345
],
"companyIds": [],
"dealIds": []
},
"attachments": [],
"scheduledTasks": [],
"metadata": {
"status": "COMPLETED",
"forObjectType": "CONTACT",
"subject": "DEMO"
}
},
{
"engagement": {
"id": 2,
"portalId": 23423,
"active": true,
"createdAt": 1661855667536,
"lastUpdated": 1661935264761,
"modifiedBy": 3453
},
"associations": {
"contactIds": [
21345
],
"companyIds": [],
"dealIds": []
},
"attachments": [],
"scheduledTasks": [],
"metadata": {
"status": "COMPLETED",
"forObjectType": "CONTACT",
"subject": "DEMO"
}
},
{
"engagement": {
"id": 3,
"portalId": 23423,
"active": true,
"createdAt": 1661855667536,
"lastUpdated": 1661935264761,
"modifiedBy": 3453
},
"associations": {
"contactIds": [
21345
],
"companyIds": [],
"dealIds": []
},
"attachments": [],
"scheduledTasks": [],
"metadata": {
"status": "COMPLETED",
"forObjectType": "CONTACT",
"subject": "DEMO"
}
}
]
For getting files from a directory I have this function. It needs a file path and optionally a file extension. It returns a list of filenames which you can use with the code above. If you need to get files from multiple directories, you can just extend the list of filenames as shown below...
import os
def get_files_from_path(path: str = ".", ext: str or list(str) = None) -> list:
"""Find files in path and return them as a list.
Gets all files in folders and subfolders
See the answer on the link below for a ridiculously
complete answer for this.
https://stackoverflow.com/a/41447012/9267296
Args:
path (str, optional): Which path to start on.
Defaults to '.'.
ext (str/list, optional): Optional file extention.
Defaults to None.
Returns:
list: list of full file paths
"""
result = []
for subdir, dirs, files in os.walk(path):
for fname in files:
filepath = f"{subdir}{os.sep}{fname}"
if ext == None:
result.append(filepath)
elif type(ext) == str and fname.lower().endswith(ext.lower()):
result.append(filepath)
elif type(ext) == list:
for item in ext:
if fname.lower().endswith(item.lower()):
result.append(filepath)
return result
filelist = get_files_from_path("your/path/here/", ext=".json")
filelist.extend(get_files_from_path("another/path/here/", ext=".json"))
Upvotes: 1
Reputation: 30
After help from @Edo Aske I found a solution to the problem. Final code looks like this:
path = '/content/extracted_data/'
json_files = [jfile for jfile in os.listdir(path) if jfile.endswith('.json')]
final_result = []
for filename in json_files:
with open(path+filename) as infile:
newdata = json.load(infile)
# grab the first list element from the results dict
newdata = newdata ["results"]
final_result.extend(newdata)
with open("result.json", "w") as outfile:
json.dump(final_result, outfile, indent=4)
The results is that all the JSON files are in separate Dicts and from there we can easily put them in a dataframe using pd.json_normalize.
Thanks for all your help guys!
Upvotes: 0