MapZombie
MapZombie

Reputation: 25

Find duplicates based on specific key/value only

I'm trying to tag objects that are duplicates in a JSON using Python, based only on the key/values for "price" and "full address" and ignoring "url". A new "duplicate" key is then created, with a 1 or a 2 value for each duplicate. How is can this be best done? Current:

 A=[   {
    "url": "google.com",
    "price": 550,
    "full address": "123 sesame st",
},
    {
    "url": "yahoo.com",
    "price": 550,
    "full address": "123 sesame st",
},
    {
    "url": "bing.com",
    "price": 250,
    "full address": "123 50th st",
}]

Intended result:

 A=[           {
        "url": "google.com",
        "price": 550,
        "full address": "123 sesame st",
        "duplicate": 1
    },
        {
        "url": "yahoo.com",
        "price": 550,
        "full address": "123 sesame st",
        "duplicate": 2
    },
        {
        "url": "bing.com",
        "price": 250,
        "full address": "123 50th st",
    }]

Upvotes: 0

Views: 138

Answers (2)

surya
surya

Reputation: 809

Optimized @iz_'s Answer:

Instead of doing second pass to delete the key for any non-duplicate, adding the duplicate key only if there are any multiple occurrences. In this way, we can iterate the whole dictionary only once.

from collections import defaultdict

A=[   {
    "url": "google.com",
    "price": 550,
    "full address": "123 sesame st",
},
    {
    "url": "yahoo.com",
    "price": 550,
    "full address": "123 sesame st",
},
    {
    "url": "bing.com",
    "price": 250,
    "full address": "123 50th st",
}
]

counts = defaultdict(dict)
for index in range(len(A)):
    d = A[index]
    k = (d["price"], d["full address"])
    counts[k]["count"] = counts[k]["count"] + 1 if counts[k].get("count") else 1
    if counts[k]["count"] == 1:
        counts[k]["first_occurence"] = index
    else:
        A[counts[k]["first_occurence"]]["duplicate"] = 1
        d["duplicate"] = counts[k]["count"]

print(A)

Output:

[{'full address': '123 sesame st', 'duplicate': 1, 'price': 550, 'url': 'google.com'}, {'full address': '123 sesame st', 'duplicate': 2, 'price': 550, 'url': 'yahoo.com'}, {'full address': '123 50th st', 'price': 250, 'url': 'bing.com'}]

Upvotes: 1

iz_
iz_

Reputation: 16613

Keep a running tally of duplicates and do a second pass to delete the key for any non-duplicate:

from collections import defaultdict

A = [
    {
        "url": "google.com",
        "price": 550,
        "full address": "123 sesame st",
    },
    {
        "url": "yahoo.com",
        "price": 550,
        "full address": "123 sesame st",
    },
    {
        "url": "bing.com",
        "price": 250,
        "full address": "123 50th st",
    },
]

counts = defaultdict(int)

for d in A:
    k = (d["price"], d["full address"])
    counts[k] += 1
    d["duplicate"] = counts[k]

for d in A:
    if counts[(d["price"], d["full address"])] == 1:
        del d["duplicate"]

print(A)

Upvotes: 1

Related Questions