Check for repeated values in JSON object array

I have a large JSON file with this structure:

[
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c9a",
      "temp":36.33,
      "x":-0.484375,
      "y":-0.0078125,
      "z":-0.859375,
      "rssi":-70,
      "id":-26648,
      "date":"2021-06-02/09:24:06.238"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c94",
      "temp":35.08,
      "x":-0.5078125,
      "y":0.0234375,
      "z":-0.84375,
      "rssi":-87,
      "id":-26633,
      "date":"2021-06-02/09:24:06.028"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c94",
      "temp":35.08,
      "x":-0.4921875,
      "y":0.0078125,
      "z":-0.8671875,
      "rssi":-87,
      "id":-26633,
      "date":"2021-06-02/09:24:06.153"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c94",
      "temp":35.08,
      "x":-0.4765625,
      "y":0.0234375,
      "z":-0.8671875,
      "rssi":-87,
      "id":-26633,
      "date":"2021-06-02/09:24:06.278"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39d3b",
      "temp":37.19,
      "x":-0.265625,
      "y":-0.0390625,
      "z":-0.9921875,
      "rssi":-86,
      "id":-30714,
      "date":"2021-06-02/09:24:06.058"
   },
   {
      "sniffer_serial":"7c9ebd9448a0",
      "serial":"086bd7c39d3b",
      "temp":37.19,
      "x":-0.21875,
      "y":0.015625,
      "z":-0.9296875,
      "rssi":-86,
      "id":-30714,
      "date":"2021-06-02/09:24:06.183"
   },
   {
      "sniffer_serial":"7c9ebd9448a0",
      "serial":"086bd7c39d3b",
      "temp":37.19,
      "x":-0.203125,
      "y":0.046875,
      "z":-0.9609375,
      "rssi":-86,
      "id":-30714,
      "date":"2021-06-02/09:24:06.308"
   }
]

What I'm trying to do is sort this file first by serial then by date, and remove any objects with the same id (even if some values change like sniffer_serial).

This is what I got so far:

import json
from itertools import groupby

#json filepath
json_file_path = "./myfile.json"

#opening and loading the file content
with open(json_file_path, 'r') as j:
     contents = json.loads(j.read())

data = {} #dict that will contain my sorted data

#sorting data
for key, items in groupby(sorted(contents, key = lambda x: (x['serial'], x['date'])), key=lambda x: x['serial']):
     data[key] = list(items)

#saving it as new file
with open('datasorted.json', 'w') as f:
    f.write(str(data))

What i'm having trouble with is removing the duplicated objects that have the same id. Should I create another dict and iterate to see if already has an entry with the same id inside it ?

How I expect the final JSON file to look like:

[
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c94",
      "temp":35.08,
      "x":-0.5078125,
      "y":0.0234375,
      "z":-0.84375,
      "rssi":-87,
      "id":-26633,
      "date":"2021-06-02/09:24:06.028"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39c9a",
      "temp":36.33,
      "x":-0.484375,
      "y":-0.0078125,
      "z":-0.859375,
      "rssi":-70,
      "id":-26648,
      "date":"2021-06-02/09:24:06.238"
   },
   {
      "sniffer_serial":"7c9ebd939ab8",
      "serial":"086bd7c39d3b",
      "temp":37.19,
      "x":-0.265625,
      "y":-0.0390625,
      "z":-0.9921875,
      "rssi":-86,
      "id":-30714,
      "date":"2021-06-02/09:24:06.058"
   },
   {
      "sniffer_serial":"7c9ebd9448a0",
      "serial":"086bd7c39d3b",
      "temp":37.19,
      "x":-0.21875,
      "y":0.015625,
      "z":-0.9296875,
      "rssi":-86,
      "id":-30714,
      "date":"2021-06-02/09:24:06.183"
   }
]

EDIT:

Creating a Pandas dataframe and trying to drop duplicates is raising the following error:

KeyError: Index(['id'], dtype='object')

Code:

dataPandas = pd.DataFrame.from_dict(data,orient='index')

dataPandas.drop_duplicates(subset="id",keep="first")

Upvotes: 1

Views: 2181

Answers (3)

pho
pho

Reputation: 25489

I see a few issues:

  1. You want a list but you add all the items you care about to a dict. Then you write this dict to your output json.
  2. In your for key, items loop, items is an iterator that contains all the items in that group. If you only care about one of the items (e.g. the first), just set that value like so: data[key] = list(items)[0]

Incorporating these changes, you'd get:

data = [] #dict that will contain my sorted data

#sorting data
for key, items in groupby(sorted(contents, key = lambda x: (x['serial'], x['date'])), key=lambda x: x['id']):
     data.append(next(items))

next(items) gets only the next item of the iterator. On the other hand, list(items)[0] would convert the entire iterator to a list, and then take the first element.

This gives us the following data:

print(json.dumps(data, indent=4))

[
    {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39c94",
        "temp": 35.08,
        "x": -0.5078125,
        "y": 0.0234375,
        "z": -0.84375,
        "rssi": -87,
        "id": -26633,
        "date": "2021-06-02/09:24:06.028"
    },
    {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39c9a",
        "temp": 36.33,
        "x": -0.484375,
        "y": -0.0078125,
        "z": -0.859375,
        "rssi": -70,
        "id": -26648,
        "date": "2021-06-02/09:24:06.238"
    },
    {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39d3b",
        "temp": 37.19,
        "x": -0.265625,
        "y": -0.0390625,
        "z": -0.9921875,
        "rssi": -86,
        "id": -30714,
        "date": "2021-06-02/09:24:06.058"
    }
]

One potential problem with this: You sort first and then do groupby. I'm not sure if this breaks the sort order, but you could always do the groupby first and then sort on the serial.

unique_contents = [next(v) for k, v in groupby(contents, key=lambda x: x['id'])]
data = sorted(unique_contents, key=lambda x: (x['serial'], x['date']))

Or in one line, use the generator expression that drove the unique_contents list comprehension as the iterator you are sorting:

data = sorted(
    (next(v) for k, v in groupby(contents, key=lambda x: x['id'])), 
    key=lambda x: (x['serial'], x['date'])
)

Also note: you can read and write json directly from the file:

#opening and loading the file content
with open(input_file_path, 'r') as j:
     contents = json.load(j)

with open(output_file_path, 'w') as j:
    json.dump(data, indent=4) # indent=4 for pretty-print

Upvotes: 1

Paul P
Paul P

Reputation: 3917

Your approach looks solid.

If you don't care about which of the duplicate elements you are using down the line, you can just take the first one:

...
for key, items in groupby(
    sorted(
        contents,
        key=lambda x: (x['serial'], x['date'])
    ),
    key=lambda x: x['serial']
):
    # items is an iterator and if you only care about the first element,
    # you can call next() once on it (instead of converting it to a list),
    # so that it doesn't iterate all entries.
    data[key] = next(items)

# Save as new file with indentation
with open('datasorted.json', 'w') as f:
    json.dump(data, f, indent=4)

The output will look like this:

{
    "086bd7c39c94": {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39c94",
        "temp": 35.08,
        "x": -0.5078125,
        "y": 0.0234375,
        "z": -0.84375,
        "rssi": -87,
        "id": -26633,
        "date": "2021-06-02/09:24:06.028"
    },
    "086bd7c39c9a": {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39c9a",
        "temp": 36.33,
        "x": -0.484375,
        "y": -0.0078125,
        "z": -0.859375,
        "rssi": -70,
        "id": -26648,
        "date": "2021-06-02/09:24:06.238"
    },
    "086bd7c39d3b": {
        "sniffer_serial": "7c9ebd939ab8",
        "serial": "086bd7c39d3b",
        "temp": 37.19,
        "x": -0.265625,
        "y": -0.0390625,
        "z": -0.9921875,
        "rssi": -86,
        "id": -30714,
        "date": "2021-06-02/09:24:06.058"
    }
}

Upvotes: 1

zglin
zglin

Reputation: 2919

Consider using pandas to create a DataRame with your dictionary using pd.DataFrame.from_dict and then running the de-dupe (pandas.DataFrame.drop_duplicates) functions.

Upvotes: 0

Related Questions