Reputation: 95

Detect duplicates in JSON list and delete it

I have a list with alterts, sometimes with duplicates in german and english language. I want to remove the duplicates from that list. So I want to say: if for an alert (which I detect as duplicate with the same timestamps for "start" and "end") is an duplicate in the list, remove this whole dataset-list from the alerts-list (which means "description", "event", "start",...): In this case the second list should be deleted:

{
"alerts": [
    {
        "description": "Es tritt leichter Frost auf.",
        "end": 1613379600,
        "event": "FROST",
        "lang": "de",
        "sender_name": "DWD / Nationales Warnzentrum Offenbach",
        "start": 1613322000
    },
    {
        "description": "There is a risk of frost",
        "end": 1613379600,
        "event": "frost",
        "lang": "en",
        "sender_name": "DWD / Nationales Warnzentrum Offenbach",
        "start": 1613322000
    },
    {
        "description": "There is a risk of wind gusts",
        "end": 1613408400,
        "event": "wind gusts",
        "lang": "en",
        "sender_name": "DWD / Nationales Warnzentrum Offenbach",
        "start": 1613336400
    }}

How can I do it in python and save the new alerts-list without duplicates? I think it must be something like this (sorry for pseudo code, I can't transfer the already given examples, I am beginner...) please help! thx a lot!

for item in data['alerts']:
    if item['start'] == item['start'] and item['end'] == item['end']
        delete

So that I get this output:

 {
"alerts": [
    {
        "description": "Es tritt leichter Frost auf.",
        "end": 1613379600,
        "event": "FROST",
        "lang": "de",
        "sender_name": "DWD / Nationales Warnzentrum Offenbach",
        "start": 1613322000
    },
    {
        "description": "There is a risk of wind gusts",
        "end": 1613408400,
        "event": "wind gusts",
        "lang": "en",
        "sender_name": "DWD / Nationales Warnzentrum Offenbach",
        "start": 1613336400
    }}

Upvotes: 0

Answers (4)

Vishal Singh

Reputation: 6234

You can group all the similar timestamps using itertools.groupby [Python-docs] and then select the document with English language.

from itertools import groupby

data["alerts"] = sorted(data["alerts"], key=lambda x: (x["end"], x["start"]))
data["alerts"] = [
    g
    for key, group in groupby(data["alerts"], key=lambda x: (x["end"], x["start"]))
    for g in group
    if g["lang"] == "en"  # change accordingly
]

Upvotes: 1

buran

Reputation: 14233

Sort the input list by lang in reverse order - en will come before de, then make a dict, where key is tuple (start, end) and use the dict.values(). Because de will come after en if there are alerts with same key start, end, de will update the value for the key.

data = {
"alerts": [
    {
        "description": "Es tritt leichter Frost auf.",
        "end": 1613379600,
        "event": "FROST",
        "lang": "de",
        "sender_name": "DWD / Nationales Warnzentrum Offenbach",
        "start": 1613322000
    },
    {
        "description": "There is a risk of wind gusts",
        "end": 1613408400,
        "event": "wind gusts",
        "lang": "en",
        "sender_name": "DWD / Nationales Warnzentrum Offenbach",
        "start": 1613336400
    }]}

unique = {(item['start'], item['end']):item for item in
           sorted(data['alerts'], key=lambda x: x['lang'], reverse=True)}
data['alerts'] = sorted(unique.values(), key=lambda x: (x['start'], x['end']))

output

{
    "alerts": [
        {
            "description": "Es tritt leichter Frost auf.",
            "end": 1613379600,
            "event": "FROST",
            "lang": "de",
            "sender_name": "DWD / Nationales Warnzentrum Offenbach",
            "start": 1613322000
        },
        {
            "description": "There is a risk of wind gusts",
            "end": 1613408400,
            "event": "wind gusts",
            "lang": "en",
            "sender_name": "DWD / Nationales Warnzentrum Offenbach",
            "start": 1613336400
        }
    ]
}

not sure if you need result sorted by time, so you can removed that part

Upvotes: 1

Icebreaker454

Reputation: 1071

You can do the filtering via dictionary comprehension:

 data = {
"alerts": [
    {
        "description": "Es tritt leichter Frost auf.",
        "end": 1613379600,
        "event": "FROST",
        "lang": "de",
        "sender_name": "DWD / Nationales Warnzentrum Offenbach",
        "start": 1613322000
    },
    {
        "description": "There is a risk of frost",
        "end": 1613379600,
        "event": "frost",
        "lang": "en",
        "sender_name": "DWD / Nationales Warnzentrum Offenbach",
        "start": 1613322000
    },
    {
        "description": "There is a risk of wind gusts",
        "end": 1613408400,
        "event": "wind gusts",
        "lang": "en",
        "sender_name": "DWD / Nationales Warnzentrum Offenbach",
        "start": 1613336400
    }]}

filtered = {(entry["start"], entry["end"]): entry for entry in reversed(data["alerts"])}

data["alerts"] = list(filtered.values())

This approach utilizes the fact that duplicated dictionary keys are overwritten with the last entry. Remove the reversed() if you'd like to keep the last duplicated entry instead of the first one

Upvotes: 2

Epsi95

Reputation: 9047

Try keeping the known timestamps in a list then in upcoming elements, check if it is already visited, then ignore.

data = {
"alerts": [
    {
        "description": "Es tritt leichter Frost auf.",
        "end": 1613379600,
        "event": "FROST",
        "lang": "de",
        "sender_name": "DWD / Nationales Warnzentrum Offenbach",
        "start": 1613322000
    },
    {
        "description": "There is a risk of frost",
        "end": 1613379600,
        "event": "frost",
        "lang": "en",
        "sender_name": "DWD / Nationales Warnzentrum Offenbach",
        "start": 1613322000
    },
    {
        "description": "There is a risk of wind gusts",
        "end": 1613408400,
        "event": "wind gusts",
        "lang": "en",
        "sender_name": "DWD / Nationales Warnzentrum Offenbach",
        "start": 1613336400
    }]}

visited_timestamp = []
output = []
for each_message in data['alerts']:
    if (each_message['end'], each_message['start']) in visited_timestamp:
        pass # don't do anything
    else:
        output.append(each_message)
        visited_timestamp.append((each_message['end'], each_message['start']))
        
data['alerts'] = output
print(data)

{
'alerts':
 [
{'description': 'Es tritt leichter Frost auf.', 'end': 1613379600, 'event': 'FROST', 'lang': 'de', 'sender_name': 'DWD / Nationales Warnzentrum Offenbach', 'start': 1613322000}, 

{'description': 'There is a risk of wind gusts', 'end': 1613408400, 'event': 'wind gusts', 'lang': 'en', 'sender_name': 'DWD / Nationales Warnzentrum Offenbach', 'start': 1613336400}
]
}

Upvotes: 0

Detect duplicates in JSON list and delete it

Answers (4)

Related Questions