Reputation: 1557

Is there a way to speed up this Python loop?

I have a 2-D np-array with its number of columns being 100 times more than the number of rows. For example, if the number of rows is 1000, number of columns is 100,000 and the values are all integers. My objective is to return 1000 unique integers for each of the 1000 row indices. Values within a column are not all unique (there can be duplicates) and hence I have to search through all values in each rows to pick the first integer value that is in a row which hasn't been picked yet in a previous operation. I have this reproducible loop which works fine for smaller num_rows around ~1000. But when it comes to dealing with more than 10,000 rows, this is painfully slow. Is there a more efficient way to deal this this?

import numpy as np
maxval = 5000
matrix = np.random.randint(maxval,size=(maxval, maxval*100))
neighbours = maxval - 1
indices = [] #this array will contain the outputs after the loop gets completed
for e in matrix:
    i = 0
    while i < neighbours:
        if e[i] in indices:
            i += 1
        else:
            indices.append(e[i])
            break

Upvotes: 0

Answers (3)

Lante Dellarovere

Reputation: 1858

Using dictionaries will be much faster, but I don't know if enough:

from collections import OrderedDict

indx = OrderedDict()

for e in matrix:
    i = 0
    while i < neighbours:
        v = e[i]
        if indx.get(v) is None:
            indx[v] = True
            break
        i += 1

results = list(indx.keys())

Upvotes: 0

RemcoGerlich

Reputation: 31270

Not a numpy way, but if row has 100,000 elements, then

import random

random.sample(set(row), 1000)

Is a random sample of 1000 unique elements from it.

Notes:

If a number occurs much more often than another, they still have the same chance of being selected
If the number of unique values is smaller than 1000, this raises ValueError
A numpy equivalent of both may exist, I don't know

Upvotes: 2

Netwave

Reputation: 42786

You can use a set instead of a list for the lookup:

import numpy as np
maxval = 50
matrix = np.random.randint(maxval,size=(maxval, maxval*100))
neighbours = maxval - 1
indices = set() #this array will contain the outputs after the loop gets completed
for e in matrix:
    i = 0
    while i < neighbours:
        if e[i] in indices:
            i += 1
        else:
            indices.add(e[i])
            break

Here you have the live example

Upvotes: 0

Is there a way to speed up this Python loop?

Answers (3)

Related Questions