onyx
onyx

Reputation: 41

Compare multiple columns in numpy array

I have a 2D numpy array with about 12 columns and 1000+ rows and each cell contains a number from 1 to 5. I'm searching for the best sextuple of columns according to my point system where 1 and 2 generate -1 point and 4 and 5 gives +1.

If a row in a certain sextuple contains, for example, [1, 4, 5, 3, 4, 3] the point for this row should be +2, because 3*1 + 1*(-1) = 2. Next row may be [1, 2, 2, 3, 3, 3] and should be -3 points.

At first, I tried a strait forward loop solution but I realized there are 665 280 possible combinations of columns to compare and when I also need to search for the best quintuple, quadruple etc. the loop is taking forever.

Is there perhaps a smarter numpy-way of solving my problem?

Upvotes: 4

Views: 4934

Answers (3)

chthonicdaemon
chthonicdaemon

Reputation: 19760

import numpy

A = numpy.random.randint(1, 6, size=(1000, 12))
points = -1*(A == 1) + -1*(A == 2) + 1*(A == 4) + 1*(A == 5)
columnsums = numpy.sum(points, 0)

def best6(row):
    return numpy.argsort(row)[-6:]

bestcolumns = best6(columnsums)
allbestcolumns = map(best6, points)

bestcolumns will now contain the best 6 columns in ascending order. By similar logic, allbestcolumns will contain the best six columns in each row.

Upvotes: 1

abought
abought

Reputation: 2680

Extending on unutbu's longer answer above, it's possible to generate the masked array of scores automatically. Since your scores for values are consistent every pass through the loop, so the scores for each value only need to be calculated once. Here's slightly inelegant way to do it on an example 6x10 array, before and after your scores are applied.

>>> import numpy
>>> values = numpy.random.randint(6, size=(6,10))
>>> values
array([[4, 5, 1, 2, 1, 4, 0, 1, 0, 4],
       [2, 5, 2, 2, 3, 1, 3, 5, 3, 1],
       [3, 3, 5, 4, 2, 1, 4, 0, 0, 1],
       [2, 4, 0, 0, 4, 1, 4, 0, 1, 0],
       [0, 4, 1, 2, 0, 3, 3, 5, 0, 1],
       [2, 3, 3, 4, 0, 1, 1, 1, 3, 2]])
>>> b = values.copy()
>>> b[ b<3 ] = -1

>>> b[ b==3 ] = 0
>>> b[ b>3 ] = 1
>>> b
array([[ 1,  1, -1, -1, -1,  1, -1, -1, -1,  1],
       [-1,  1, -1, -1,  0, -1,  0,  1,  0, -1],
       [ 0,  0,  1,  1, -1, -1,  1, -1, -1, -1],
       [-1,  1, -1, -1,  1, -1,  1, -1, -1, -1],
       [-1,  1, -1, -1, -1,  0,  0,  1, -1, -1],
       [-1,  0,  0,  1, -1, -1, -1, -1,  0, -1]])

Incidentally, this thread claims that creating the combinations directly within numpy will yield around 5x faster performance than itertools, though perhaps at the expense of some readability.

Upvotes: 0

unutbu
unutbu

Reputation: 879291

import numpy as np
import itertools

N_rows = 10
arr = np.random.random_integers(5, size=(N_rows,12))
x = np.array([0,-1,-1,0,1,1])
y = x[arr]

print(y)

score, best_sextuple = max((y[:,cols].sum(), cols)
                           for cols in itertools.combinations(range(12),6))
print('''\
score: {s}
sextuple: {c}
'''.format(s = score, c = best_sextuple))

yields, for example,

score: 6
sextuple: (0, 1, 5, 8, 10, 11)

Explanation:

First, let's generate a random example, with 12 columns and 10 rows:

N_rows = 10
arr = np.random.random_integers(5, size=(N_rows,12))

Now we can use numpy indexing to convert the numbers in arr 1,2,...,5 to the values -1,0,1 (according to your scoring system):

x = np.array([0,-1,-1,0,1,1])
y = x[arr]

Next, let's use itertools.combinations to generate all possible combinations of 6 columns:

for cols in itertools.combinations(range(12),6)

and

y[:,cols].sum()

then gives the score for cols, a choice of columns (sextuple).

Finally, use max to pick off the sextuple with the best score:

score, best_sextuple = max((y[:,cols].sum(), cols)
                           for cols in itertools.combinations(range(12),6))

Upvotes: 1

Related Questions