Reputation: 21
I'm sort of new to Python and I am trying to figure out how to find all duplicates within a JSON file. So far I've created this python script to open and read the JSON file and parse the JSON report. I need to figure out a way to find all potential duplicate transactions and to print each line to contain the date, amount, description, and transactionID. Please let me know if I am on the correct path, any suggestions or pointers would help.
from asyncio.base_tasks import _task_print_stack
import json
#Opens the Formatted JSON File
file_handle = open("42525022_formatted-1.json", "r")
contents = file_handle.read()
#Parses the JSON file - categories report, items, accounts and transactions.
parsed = json.loads(contents)
transactions = parsed["report"]["items"][0]["accounts"][0]["transactions"]
transactions_by_date ={}
for txn in transactions:
date = txn["date"]
description = txn["original_description"]
if date not in transactions_by_date:
transactions_by_date[date] = []
transactions_by_date[date].append(
{
"amount": txn["amount"],
"description": txn["original_description"],
"transaction_id": txn["transaction_id"]
}
)
#Ignored
#print(txn["date"] + "\n" + str(txn["amount"]))
#print(transactions_by_date)
for date in transactions_by_date:
transactions = transactions_by_date[date]
print(transactions)
break
#Objective
#Print all duplicates within a calendar date should have date, amount, description and transactionID
Example JSON File Contents
"account_id": "zbbbZEdzo4iZbed98AbzHeqr3VX0NztOBQgZe",
"amount": 0,
"date": "2022-07-02",
"iso_currency_code": "USD",
"original_description": "GOOGLE *ADS598329",
"pending": true,
"transaction_id": "1XXX9XbVRKHj8eN66",
"unofficial_currency_code": null
},
Upvotes: 0
Views: 979
Reputation: 626
Would just detecting a duplicate ID be sufficient, or is there a chance there are multiple transactions with the same ID, but differing values for the other attributes?
I know you asked about achieving this via python dictionary, however an additional tool would help here
I would suggesting using a library like pandas.
Then you can think of your data as in a spreadsheet.
import pandas as pd
df = pd.DataFrame(transactions)
duplicates = df.duplicated()
Check out the documentation:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
Upvotes: 1