Reputation: 668
I found that there are several ways of removing duplicated data. However, for obvious reasons none of them (at least what I've found) removes the duplicates entirely but rather maintains a single unique data point. However, I concluded for my model that this results in some erroneous behavior and was wondering if there is any way that I can remove all candidates of the duplicates. To be more clear, if the data is as below for instance:
x = [[1, 2, 3, 4],
[1, 2, 3, 4],
[5, 2, 1, 4],
[5, 2, 1, 4],
[3, 4, 2, 4]]
Then I want nothing but the last row [3, 4, 2, 4] where duplicates are removed entirely (I'm struggling to find the right expression). I've tried using 'for' loop (by extracting the data that was not unique and comparing them each to the unique data set then removing them as well), however, my data is about 50k and this takes too much time. Is there an efficient way to do this in python?
P.S. just in case, I used the code below to find the unique set of data points
temp = np.ascontiguousarray(raw_input).view(np.dtype((np.void, raw_input.dtype.itemsize*raw_input.shape[1])))
_, idx = np.unique(temp, return_index = True)
input_data = raw_input[idx] # unique input data
output_data = output_label[idx]
Upvotes: 1
Views: 94
Reputation: 1369
check this out
final_list = list(filter(lambda tup:x.count(list(tup))==1, list(set(map(tuple,x)))))
list(map(list,final_list))
Upvotes: 0
Reputation: 8378
Staying within "standard" Python,
from collections import Counter
c = Counter(map(tuple, x))
output_data = [list(k) for k, v in c.items() if v == 1]
If you want to know the indices (in x
) of rows that were removed (because they had duplicates), you can do the following:
rem = [idx for idx, k in enumerate(x) if c[tuple(k)] > 1]
Alternatively (or preferably) using numpy
:
u, invidx, cnt = np.unique(x, axis=0, return_inverse=True, return_counts=True)
rem = np.flatnonzero(cnt[invidx] > 1)
output_data = u[cnt == 1]
In [1]: from collections import Counter
In [2]: x = [[1, 2, 3, 4],
...: [1, 2, 3, 4],
...: [5, 2, 1, 4],
...: [5, 2, 1, 4],
...: [3, 4, 2, 4]]
...:
In [3]: c = Counter(map(tuple, x))
In [4]: output_data = [list(k) for k, v in c.items() if v == 1]
In [5]: print(output_data)
[[3, 4, 2, 4]]
numpy
:In [30]: u, invidx, cnt = np.unique(x, axis=0, return_inverse=True,
...: return_counts=True)
In [31]: print(u)
[[1 2 3 4]
[3 4 2 4]
[5 2 1 4]]
In [32]: print(invidx)
[0 0 2 2 1]
In [33]: print(cnt)
[2 1 2]
In [34]: rem = np.flatnonzero(cnt[invidx] > 1)
In [35]: output_data = u[cnt == 1]
In [36]: print(rem)
[0 1 2 3]
In [37]: print(output_data)
[[3 4 2 4]]
Upvotes: 4
Reputation: 2133
does this work for you?:
a=[[1,2],[1,2],[2,3],[3,4],[3,4]]
b=a[:]
for i in range(len(a)-1,0,-1):
if a[i] == a[i-1]:
del b[i-1:i+1]
# a == [[1, 2], [1, 2], [2, 3], [3, 4], [3, 4]]
# b == [[2, 3]]
Upvotes: 0