Reputation: 1557
I have a 2-D np-array with its number of columns being 100 times more than the number of rows. For example, if the number of rows is 1000, number of columns is 100,000 and the values are all integers. My objective is to return 1000 unique integers for each of the 1000 row indices. Values within a column are not all unique (there can be duplicates) and hence I have to search through all values in each rows to pick the first integer value that is in a row which hasn't been picked yet in a previous operation. I have this reproducible loop which works fine for smaller num_rows around ~1000. But when it comes to dealing with more than 10,000 rows, this is painfully slow. Is there a more efficient way to deal this this?
import numpy as np
maxval = 5000
matrix = np.random.randint(maxval,size=(maxval, maxval*100))
neighbours = maxval - 1
indices = [] #this array will contain the outputs after the loop gets completed
for e in matrix:
i = 0
while i < neighbours:
if e[i] in indices:
i += 1
else:
indices.append(e[i])
break
Upvotes: 0
Views: 93
Reputation: 1858
Using dictionaries will be much faster, but I don't know if enough:
from collections import OrderedDict
indx = OrderedDict()
for e in matrix:
i = 0
while i < neighbours:
v = e[i]
if indx.get(v) is None:
indx[v] = True
break
i += 1
results = list(indx.keys())
Upvotes: 0
Reputation: 31250
Not a numpy way, but if row
has 100,000 elements, then
import random
random.sample(set(row), 1000)
Is a random sample of 1000 unique elements from it.
Notes:
Upvotes: 2
Reputation: 42678
You can use a set
instead of a list for the lookup:
import numpy as np
maxval = 50
matrix = np.random.randint(maxval,size=(maxval, maxval*100))
neighbours = maxval - 1
indices = set() #this array will contain the outputs after the loop gets completed
for e in matrix:
i = 0
while i < neighbours:
if e[i] in indices:
i += 1
else:
indices.add(e[i])
break
Here you have the live example
Upvotes: 0