Reputation: 668

Removing Duplicated data entirely without maintaining one

I found that there are several ways of removing duplicated data. However, for obvious reasons none of them (at least what I've found) removes the duplicates entirely but rather maintains a single unique data point. However, I concluded for my model that this results in some erroneous behavior and was wondering if there is any way that I can remove all candidates of the duplicates. To be more clear, if the data is as below for instance:

x = [[1, 2, 3, 4],
     [1, 2, 3, 4],
     [5, 2, 1, 4],
     [5, 2, 1, 4],
     [3, 4, 2, 4]]

Then I want nothing but the last row [3, 4, 2, 4] where duplicates are removed entirely (I'm struggling to find the right expression). I've tried using 'for' loop (by extracting the data that was not unique and comparing them each to the unique data set then removing them as well), however, my data is about 50k and this takes too much time. Is there an efficient way to do this in python?

P.S. just in case, I used the code below to find the unique set of data points

temp = np.ascontiguousarray(raw_input).view(np.dtype((np.void, raw_input.dtype.itemsize*raw_input.shape[1])))
_, idx = np.unique(temp, return_index = True)
input_data = raw_input[idx] # unique input data
output_data = output_label[idx]

Upvotes: 1

Answers (3)

letmecheck

Reputation: 1369

check this out

final_list = list(filter(lambda tup:x.count(list(tup))==1, list(set(map(tuple,x)))))
list(map(list,final_list))

Upvotes: 0

AGN Gazer

Reputation: 8378

Staying within "standard" Python,

from collections import Counter
c = Counter(map(tuple, x))
output_data = [list(k) for k, v in c.items() if v == 1]

If you want to know the indices (in x) of rows that were removed (because they had duplicates), you can do the following:

rem = [idx for idx, k in enumerate(x) if c[tuple(k)] > 1]

Alternatively (or preferably) using numpy:

u, invidx, cnt = np.unique(x, axis=0, return_inverse=True, return_counts=True)
rem = np.flatnonzero(cnt[invidx] > 1)
output_data = u[cnt == 1]

Example:

In [1]: from collections import Counter

In [2]: x = [[1, 2, 3, 4],
   ...:      [1, 2, 3, 4],
   ...:      [5, 2, 1, 4],
   ...:      [5, 2, 1, 4],
   ...:      [3, 4, 2, 4]]
   ...:      

In [3]: c = Counter(map(tuple, x))

In [4]: output_data = [list(k) for k, v in c.items() if v == 1]

In [5]: print(output_data)
[[3, 4, 2, 4]]

Example using `numpy`:

In [30]: u, invidx, cnt = np.unique(x, axis=0, return_inverse=True,
    ...: return_counts=True)

In [31]: print(u)
[[1 2 3 4]
 [3 4 2 4]
 [5 2 1 4]]

In [32]: print(invidx)
[0 0 2 2 1]

In [33]: print(cnt)
[2 1 2]

In [34]: rem = np.flatnonzero(cnt[invidx] > 1)

In [35]: output_data = u[cnt == 1]

In [36]: print(rem)
[0 1 2 3]

In [37]: print(output_data)
[[3 4 2 4]]

Upvotes: 4

AcK

Reputation: 2133

does this work for you?:

a=[[1,2],[1,2],[2,3],[3,4],[3,4]]
b=a[:]
for i in range(len(a)-1,0,-1):
    if a[i] == a[i-1]:
        del b[i-1:i+1]

# a == [[1, 2], [1, 2], [2, 3], [3, 4], [3, 4]]
# b == [[2, 3]]

Upvotes: 0

Removing Duplicated data entirely without maintaining one

Answers (3)

Example:

Example using numpy:

Related Questions

Example using `numpy`: