Reputation: 45
mat = [ [1,3,5,7], [1,2,5,7], [8,2,3,4] ]
I have to design a function that can count the number of rows with the same value (per column) taking into account a reference row.
The result array for every row will be
row0 = [2,1,2,2]
row1 = [2,2,2,2]
row3 = [1,2,1,1]
every row of the matrix mat is a user and every columns is a tag for the user's position in a defined unit of time. So I have to count for every defined time (i.e. the columns)how many users share the same position.
I try to use the numpy count_nonzero function but it requires a condition that I cannot be able to spread across all the reference row
Upvotes: 3
Views: 288
Reputation: 10971
There is a simple solution that is 1) count the number of elements you have in each column, 2) use that count to build another list.
from collections import Counter
mat = [[1,3,5,7], [1,2,5,7], [8,2,3,4]]
col_counts = [Counter(col) for col in zip(*mat)]
results = [[count[cell] for cell, count in zip(row, col_counts)] for row in mat]
The result is:
[[2, 1, 2, 2], [2, 2, 2, 2], [1, 2, 1, 1]]
Note that in the first row [1,3,5,7]
, element 3
corresponds to a 1
not a zero as you have exactly one 3
in the second column [3, 2, 2]
.
A slightly lighter solution (only uses one counter at a time), I also detailed the transformation line by line so it's easier to understand:
def row_count(mat):
def row_transform(row):
count = Counter(row)
return [count[e] for e in row]
matT = zip(*mat)
matT_count = map(row_transform, matT)
return zip(*matT_count)
If you need a list then you can call list(row_count(mat))
if you only need to iterate over your rows you can do for row in row_count(mat):
and it will save you some more memory (only instantiating one row at a time).
Upvotes: 0
Reputation: 53119
Here is a numpy
solution using `argsort. This can handle non-integer entries:
import numpy as np
def count_per_col(a):
o = np.argsort(a, 0)
ao = np.take_along_axis(a, o, 0)
padded = np.ones((ao.shape[1], ao.shape[0]+1), int)
padded[:, 1:-1] = np.diff(ao, axis=0).T
i, j = np.where(padded)
j = np.maximum(np.diff(j), 0)
J = j.repeat(j)
out = np.empty(a.shape, int)
np.put_along_axis(out, o, J.reshape(out.shape[::-1]).T, 0)
return out
mat = np.array([[1,3,5,7], [1,2,5,7], [8,2,3,4]])
count_per_col(mat)
# array([[2, 1, 2, 2],
# [2, 2, 2, 2],
# [1, 2, 1, 1]])
How fast?
from timeit import timeit
large = np.random.randint(0, 100, (100, 10000))
large = np.random.random(100)[large]
timeit(lambda: count_per_col(large), number=10)/10
# 0.1332556433044374
Upvotes: 1
Reputation: 36859
A simple, vectorized solution is to use
mat = np.array([
[1,3,5,7],
[1,2,5,7],
[8,2,3,4]
])
tmp = mat + np.arange(mat.shape[1]) * np.max(mat)
np.bincount(tmp.ravel())[tmp]
# array([[2, 1, 2, 2],
# [2, 2, 2, 2],
# [1, 2, 1, 1]])
Timings for a 64x8640 matrix:
# 4 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 1