Reputation: 58791
I have a (large) list of lists of integers, e.g.,
a = [
[1, 2],
[3, 6],
[2, 1],
[3, 5],
[3, 6]
]
Most of the pairs will appear twice, where the order of the integers doesn't matter (i.e., [1, 2]
is equivalent to [2, 1]
). I'd now like to find the pairs that appear only once, and get a Boolean list indicating that. For the above example,
b = [False, False, False, True, False]
Since a
is typically large, I'd like to avoid explicit loops. Mapping to frozenset
s may be advised, but I'm not sure if that's overkill.
Upvotes: 14
Views: 16633
Reputation: 58791
Here's a solution with NumPy that 10 times faster than the suggested frozenset
solution:
a = numpy.array(a)
a.sort(axis=1)
b = numpy.ascontiguousarray(a).view(
numpy.dtype((numpy.void, a.dtype.itemsize * a.shape[1]))
)
_, inv, ct = numpy.unique(b, return_inverse=True, return_counts=True)
print(ct[inv] == 1)
Sorting is fast and makes sure that the edges [i, j]
, [j, i]
in the original array identify with each other. Much faster than frozenset
s or tuple
s.
Row uniquification inspired by https://stackoverflow.com/a/16973510/353337.
Speed comparison for different array sizes:
The plot was created with
from collections import Counter
import numpy
import perfplot
def fs(a):
ctr = Counter(frozenset(x) for x in a)
b = [ctr[frozenset(x)] == 1 for x in a]
return b
def with_numpy(a):
a = numpy.array(a)
a.sort(axis=1)
b = numpy.ascontiguousarray(a).view(
numpy.dtype((numpy.void, a.dtype.itemsize * a.shape[1]))
)
_, inv, ct = numpy.unique(b, return_inverse=True, return_counts=True)
res = ct[inv] == 1
return res
perfplot.save(
"out.png",
setup=lambda n: numpy.random.randint(0, 10, size=(n, 2)),
kernels=[fs, with_numpy],
labels=["frozenset", "numpy"],
n_range=[2 ** k for k in range(15)],
xlabel="len(a)",
)
Upvotes: 11
Reputation: 391
Use a dictionary for an O(n) solution.
a = [ [1, 2], [3, 6], [2, 1], [3, 5], [3, 6] ]
dict = {}
boolList = []
# Iterate through a
for i in range (len(a)):
# Assume that this element is not a duplicate
# This 'True' is added to the corresponding index i of boolList
boolList += [True]
# Set elem to the current pair in the list
elem = a[i]
# If elem is in ascending order, it will be entered into the map as is
if elem[0] <= elem[1]:
key = repr(elem)
# If not, change it into ascending order so keys can easily be compared
else:
key = repr( [ elem[1] ] + [ elem[0] ])
# If this pair has not yet been seen, add it as a key to the dictionary
# with the value a list containing its index in a.
if key not in dict:
dict[key] = [i]
# If this pair is a duploicate, add the new index to the dict. The value
# of the key will contain a list containing the indeces of that pair in a.
else:
# Change the value to contain the new index
dict[key] += [i]
# Change boolList for this to True for this index
boolList[i] = False
# If this is the first duplicate for the pair, make the first
# occurrence of the pair into a duplicate as well.
if len(dict[key]) <= 2:
boolList[ dict[key][0] ] = False
print a
print boolList
Upvotes: -1
Reputation: 1567
ctr = Counter(frozenset(x) for x in a)
b = [ctr[frozenset(x)] == 1 for x in a]
We can use Counter to get counts of each list (turn list to frozenset to ignore order) and then for each list check if it only appears once.
Upvotes: 14
Reputation: 137
#-*- coding : utf-8 -*-
a = [[1, 2], [3, 6], [2, 1], [3, 5], [3, 6]]
result = filter(lambda el:(a.count([el[0],el[1]]) + a.count([el[1],el[0]]) == 1),a)
bool_res = [ (a.count([el[0],el[1]]) + a.count([el[1],el[0]]) == 1) for el in a]
print result
print bool_res
wich gives :
[[3, 5]]
[False, False, False, True, False]
Upvotes: 2
Reputation: 8781
You could scan the list from start to end, while maintaining a map
of encountered pairs to their first position. Whenever you process a pair, you check to see if you've encountered it before. If that's the case, both the first encounter's index in b and the current encounter's index must be set to False. Otherwise, we just add the current index to the map of encountered pairs and change nothing about b. b will start initially all True
. To keep things equivalent wrt [1,2]
and [2,1]
, I'd first simply sort the pair, to obtain a stable representation. The code would look something like this:
def proc(a):
b = [True] * len(a) # Better way to allocate this
filter = {}
idx = 0
for p in a:
m = min(p)
M = max(p)
pp = (m, M)
if pp in filter:
# We've found the element once previously
# Need to mark both it and the current value as "False"
# If we encounter pp multiple times, we'll set the initial
# value to False multiple times, but that's not an issue
b[filter[pp]] = False
b[idx] = False
else:
# This is the first time we encounter pp, so we just add it
# to the filter for possible later encounters, but don't affect
# b at all.
filter[pp] = idx
idx++
return b
The time complexity is O(len(a))
which is good, but the space complexity is also O(len(a))
(for filter
), so this might not be so great. Depending on how flexible you are, you can use an approximate filter such as a Bloom filter.
Upvotes: 2