Pytest : compare two json files

Question

I have an API that creates a JSON file, like below:

"tesla_2.0": {
        "kind": "Auto",
        "tar_path": "/home/scripts/project_2/tesla_2.0.zip",
        "version": "2.0",
        "yaml_path": "/home/scripts/project_2/test.yaml",
        "name": "tesla"
    }

Since I'm reading it from a file, I use json.load() that will lose the order of the saved object unless I tell it to load into an OrderedDict().

is there a simple and efficient way to compare the to files ?

 def compare_json_files(file_1, file_2):
        if not os.path.isfile(file_1):
            raise FileNotFoundError("File not found: {}".format(file_1))
        if not os.path.isfile(file_2):
            raise FileNotFoundError("File not found: {}".format(file_2))
        with open(file_1, 'r') as f1:
            data_1 = json.loads(f1)
        with open(file_2, 'r') as f2:
            data_2 = json.loads(f2)
        comparison operation

Python version : 3.5.2

Alexandre Mahdhaoui · Accepted Answer

I do believe you could check every keys and values. You should first check the set of keys are equal in both sides, then key by key comparison will make sense.

assert(data_1.keys() == data_2.keys())
err_log = [['Err log:']] 
for k, v in data_1.items():
    try:
        assert(v == data_2[k])
    except:
        err_log.append(['Error catched for key=', k, ', data_1 value=', v, ', data_2 value=', data_2[k]])
[print(str(e)) for e in err_log]

Edit 3: Tested on very big dictionaries.

Best results are obtained with itemgetter of sorted list of keys for very large dictionaries.

Iterating through all keys of dictionaries is the worst. Iterating with ordered list of keys seems to perform slightly better.

Results:

n = 100:
- 4e-3 seconds
- 3e-3 seconds
- 1.5e-3 seconds
- 1.9e-3 seconds
n = 1,000,000:
- 8.9 seconds
- 9.2 seconds
- 7.2 seconds
- 6.9 seconds
n = 10,000,000:
- 143 seconds
- 130 seconds
- 115 seconds
- 99 seconds

from copy import deepcopy
from time import time
from operator import itemgetter

n = 10000000
v = {"stuff": "here", "and": "there"}
data_1 = {str(k): deepcopy(v) for k in range(0, n)}
data_2 = {str(k): deepcopy(v) for k in range(n-1, -1, -1)}

def get_time(f):
    def _(*args, **kwargs):
        t_0 = time()
        for x in range(10):
            f(*args, **kwargs)
        return time() - t_0
    return _

def with_dict_keys(d):
    return d.keys()

def with_sorted_dict_keys(d):
    return sorted(d.keys())

@get_time
def order_n_compare(key_func, d, d_):
    k_d, k_d_ = key_func(d), key_func(d_)
    assert(k_d == k_d_)
    for k in k_d:
        assert(d[k] == d_[k])


@get_time
def itemgetter_compare(key_func, d, d_):
    k_d, k_d_ = key_func(d), key_func(d_)
    assert(k_d == k_d_)
    assert(itemgetter(*k_d)(d) == itemgetter(*k_d)(d_))

Edit 0: added try & except block to print out where assertions are wrong

Edit 1: fixed minor bug

Edit 2: Check computation time: the `dictionnary.keys()` operation is irrelevant over iterating through all keys in `data_1.items()` because it grows order n. So it's not really necessary to optimize it.

Note: If sorting dict.keys() is order(log(n)) then the operation time of getting dict.keys() seems to be order log(n) too.

Pytest : compare two json files

Answers (1)

Edit 3: Tested on very big dictionaries.

Best results are obtained with itemgetter of sorted list of keys for very large dictionaries.

Iterating through all keys of dictionaries is the worst. Iterating with ordered list of keys seems to perform slightly better.

Edit 0: added try & except block to print out where assertions are wrong

Edit 1: fixed minor bug

Edit 2: Check computation time: the `dictionnary.keys()` operation is irrelevant over iterating through all keys in `data_1.items()` because it grows order n. So it's not really necessary to optimize it.

Related Questions

Pytest : compare two json files

Answers (1)

Edit 3: Tested on very big dictionaries.

Best results are obtained with itemgetter of sorted list of keys for very large dictionaries.

Iterating through all keys of dictionaries is the worst. Iterating with ordered list of keys seems to perform slightly better.

Edit 0: added try & except block to print out where assertions are wrong

Edit 1: fixed minor bug

Edit 2: Check computation time: the dictionnary.keys() operation is irrelevant over iterating through all keys in data_1.items() because it grows order n. So it's not really necessary to optimize it.

Related Questions

Edit 2: Check computation time: the `dictionnary.keys()` operation is irrelevant over iterating through all keys in `data_1.items()` because it grows order n. So it's not really necessary to optimize it.