Nico Schlömer
Nico Schlömer

Reputation: 58791

Find unique pairs in list of pairs

I have a (large) list of lists of integers, e.g.,

a = [
    [1, 2],
    [3, 6],
    [2, 1],
    [3, 5],
    [3, 6]
    ]

Most of the pairs will appear twice, where the order of the integers doesn't matter (i.e., [1, 2] is equivalent to [2, 1]). I'd now like to find the pairs that appear only once, and get a Boolean list indicating that. For the above example,

b = [False, False, False, True, False]

Since a is typically large, I'd like to avoid explicit loops. Mapping to frozensets may be advised, but I'm not sure if that's overkill.

Upvotes: 14

Views: 16633

Answers (5)

Nico Schlömer
Nico Schlömer

Reputation: 58791

Here's a solution with NumPy that 10 times faster than the suggested frozenset solution:

a = numpy.array(a)
a.sort(axis=1)
b = numpy.ascontiguousarray(a).view(
    numpy.dtype((numpy.void, a.dtype.itemsize * a.shape[1]))
)
_, inv, ct = numpy.unique(b, return_inverse=True, return_counts=True)
print(ct[inv] == 1)
  • Sorting is fast and makes sure that the edges [i, j], [j, i] in the original array identify with each other. Much faster than frozensets or tuples.

  • Row uniquification inspired by https://stackoverflow.com/a/16973510/353337.

Speed comparison for different array sizes:

enter image description here

The plot was created with

from collections import Counter
import numpy
import perfplot


def fs(a):
    ctr = Counter(frozenset(x) for x in a)
    b = [ctr[frozenset(x)] == 1 for x in a]
    return b


def with_numpy(a):
    a = numpy.array(a)
    a.sort(axis=1)
    b = numpy.ascontiguousarray(a).view(
        numpy.dtype((numpy.void, a.dtype.itemsize * a.shape[1]))
    )
    _, inv, ct = numpy.unique(b, return_inverse=True, return_counts=True)
    res = ct[inv] == 1
    return res


perfplot.save(
    "out.png",
    setup=lambda n: numpy.random.randint(0, 10, size=(n, 2)),
    kernels=[fs, with_numpy],
    labels=["frozenset", "numpy"],
    n_range=[2 ** k for k in range(15)],
    xlabel="len(a)",
)

Upvotes: 11

Yaelle
Yaelle

Reputation: 391

Use a dictionary for an O(n) solution.

a = [ [1, 2], [3, 6], [2, 1], [3, 5], [3, 6] ]

dict = {}
boolList = []

# Iterate through a
for i in range (len(a)):

    # Assume that this element is not a duplicate
    # This 'True' is added to the corresponding index i of boolList
    boolList += [True]

    # Set elem to the current pair in the list
    elem = a[i]

    # If elem is in ascending order, it will be entered into the map as is
    if elem[0] <= elem[1]:
        key = repr(elem)
    # If not, change it into ascending order so keys can easily be compared
    else:
        key = repr( [ elem[1] ] + [ elem[0] ])

    # If this pair has not yet been seen, add it as a key to the dictionary
    # with the value a list containing its index in a.
    if key not in dict:
        dict[key] = [i]
    # If this pair is a duploicate, add the new index to the dict. The value
    # of the key will contain a list containing the indeces of that pair in a.
    else:
        # Change the value to contain the new index
        dict[key] += [i]

        # Change boolList for this to True for this index
        boolList[i] = False

        # If this is the first duplicate for the pair, make the first
        # occurrence of the pair into a duplicate as well.
        if len(dict[key]) <= 2:
            boolList[ dict[key][0] ] = False

print a
print boolList

Upvotes: -1

petabyte
petabyte

Reputation: 1567

ctr = Counter(frozenset(x) for x in a)
b = [ctr[frozenset(x)] == 1 for x in a]

We can use Counter to get counts of each list (turn list to frozenset to ignore order) and then for each list check if it only appears once.

Upvotes: 14

labzus
labzus

Reputation: 137

#-*- coding : utf-8 -*-
a = [[1, 2], [3, 6], [2, 1], [3, 5], [3, 6]]
result = filter(lambda el:(a.count([el[0],el[1]]) + a.count([el[1],el[0]]) == 1),a)
bool_res = [ (a.count([el[0],el[1]]) + a.count([el[1],el[0]]) == 1) for el in a]
print result
print bool_res

wich gives :

[[3, 5]]
[False, False, False, True, False]

Upvotes: 2

Horia Coman
Horia Coman

Reputation: 8781

You could scan the list from start to end, while maintaining a map of encountered pairs to their first position. Whenever you process a pair, you check to see if you've encountered it before. If that's the case, both the first encounter's index in b and the current encounter's index must be set to False. Otherwise, we just add the current index to the map of encountered pairs and change nothing about b. b will start initially all True. To keep things equivalent wrt [1,2] and [2,1], I'd first simply sort the pair, to obtain a stable representation. The code would look something like this:

def proc(a):
  b = [True] * len(a) # Better way to allocate this
  filter = {}
  idx = 0
  for p in a:
    m = min(p)
    M = max(p)
    pp = (m, M)
    if pp in filter:
      # We've found the element once previously
      # Need to mark both it and the current value as "False"
      # If we encounter pp multiple times, we'll set the initial
      # value to False multiple times, but that's not an issue
      b[filter[pp]] = False
      b[idx] = False
    else:
      # This is the first time we encounter pp, so we just add it
      # to the filter for possible later encounters, but don't affect
      # b at all.
      filter[pp] = idx
    idx++
  return b

The time complexity is O(len(a)) which is good, but the space complexity is also O(len(a)) (for filter), so this might not be so great. Depending on how flexible you are, you can use an approximate filter such as a Bloom filter.

Upvotes: 2

Related Questions