Reputation: 11
Does anyone has experience in creating sparse matrix with the non-zero values follows a uniform distribution of [-0.5, 0.5] and has zero mean (zero centered) in python (e.g. using Scipy.sparse)?
I am aware that scipy.sparse package provide a few method on creating random sparse matrix, like 'rand' and 'random'. However I could not achieve what I want with those method. For example, I tried:
import numpy as np
import scipy.sparse as sp
s = np.random.uniform(-0.5,0.5)
W=sp.random(1024, 1024, density=0.01, format='csc', data_rvs=s)
To specifiy my idea: Let say I want the above mentioned matrix which is non-sparse, or dense, I will create it by:
dense=np.random.rand(1024,1024)-0.5
'np.random.rand(1024,1024)' will create a dense uniform matrix with values in [0,1]. To make it zero mean, I centre the matrix by substract it 0.5.
However if I create a sparse matrix, let say:
sparse=sp.rand(1024,1024,density=0.01, format='csc')
The matrix will be having non-zero values in uniform [0,1]. However, if I want to centre the matrix, I cannot simply do 'sparse-=0.5' which will cause all the originally zero entries non-zero after substraction.
So, how can I achieve the same as for the above example for dense matrix on sparse matrix?
Thank you for all of your help!
Upvotes: 1
Views: 779
Reputation: 7304
The data_rvs
parameter is expecting a "callable" that takes a size. This isn't exactly obvious from the documentation. This can be done with a lambda as follows:
import numpy as np
import scipy.sparse as sp
W = sp.random(1024, 1024, density=0.01, format='csc',
data_rvs=lambda s: np.random.uniform(-0.5, 0.5, size=s))
Then print(W)
gives:
(243, 0) -0.171300809713
(315, 0) 0.0739590145626
(400, 0) 0.188151369316
(440, 0) -0.187384896218
: :
(1016, 0) 0.29262088084
(156, 1) -0.149881296136
(166, 1) -0.490405135834
(191, 1) 0.188167190147
(212, 1) 0.0334533020488
: :
(411, 1) 0.122330200832
(431, 1) -0.0494334160833
(813, 1) -0.0076379249885
(828, 1) 0.462807265425
: :
(840, 1021) 0.456423017883
(12, 1022) -0.47313075329
: :
(563, 1022) -0.477190349161
(655, 1022) -0.460942546313
(673, 1022) 0.0930207181126
(676, 1022) 0.253643616387
: :
(843, 1023) 0.463793903168
(860, 1023) 0.454427252782
For the newbie, the lambda may look odd - this is just an unnamed function. The sp.random
function takes an optional argument data_rvs
that defaults to None
. When specified, it is expected to be a function that takes a size argument and returns that number of random numbers. A simple function to do this would be:
def generate_n_uniform_randoms(n):
return np.uniform(-0.5, 0.5, n)
I don't know the origin of the API, but the shape is not needed as sp.random
presumably first figures out which indices will be non-zero, and then it just needs to compute random values for those indices, which is a set of a known size.
The lambda is just syntactic sugar that allows us to define that function inline in terms of some other function call. We could instead write
W = sp.random(1024, 1024, density=0.01, format='csc',
data_rvs=generate_n_uniform_randoms)
Actually, this can be a "callable" - some object f
for which f(n)
returns n
random variables. This can be a function, but it can also be an object of a class that implements the __call__(self, n)
function. For example:
class ufoo(object):
def __call__(self, n):
import numpy
return numpy.random.uniform(-0.5, 0.5, n)
W = sp.random(1024, 1024, density=0.01, format='csc',
data_rvs=ufoo())
If you need the mean to be exactly zero (within roundoff of course), this can be done by subtracting the mean from the non-zero values, as I mentioned above:
W.data -= np.mean(W.data)
Then:
W[idx].mean()
-2.3718641632430623e-18
Upvotes: 1
Reputation: 33542
In my opinion, your requirements are still incomplete (see disadvantage mentioned below).
Here is some implementation for my simple construction outlined above in my comment:
import numpy as np
import scipy.sparse as sp
M, N, NNZ = 5, 5, 10
assert NNZ % 2 == 0
flat_dim = M*N
valuesA = np.random.uniform(-0.5, 0.5, size=NNZ // 2)
valuesB = valuesA * -1
values = np.hstack((valuesA, valuesB))
positions_flat = np.random.choice(flat_dim, size=NNZ, replace=False)
positions_2d = np.unravel_index(positions_flat, dims=(M, N))
mat = sp.coo_matrix((values, (positions_2d[0], positions_2d[1])), shape=(M, N))
print(mat.todense())
print(mat.data.mean())
Output:
[[ 0. 0. 0. 0.0273862 0. ]
[-0.3943963 0. 0. -0.04134932 0. ]
[-0.10121743 0. -0.0273862 0. 0.04134932]
[ 0.3943963 0. 0. 0. 0. ]
[-0.24680983 0. 0.24680983 0.10121743 0. ]]
0.0
Now in regards to that linked problem: i'm guessing here, but i would not be surprised to see that sampling x
values uniformly with the constraint mean(x)=0
is NP-hard.
Keep in mind, that a-posteriori centering of nonzeros, as recommend in the other answer, changes the underlying distribution (even for simple distributions). In some cases even invalidating bounds (leaving interval -0.5, 0.5).
This means: this question is all about formalizing which objective is how important and balance these out in some way.
Upvotes: 0
Reputation: 231738
sparse.random
does 2 things - distributes nonzeros randomly, and generates random uniform values.
In [62]: M = sparse.random(10,10,density=.2, format='csr')
In [63]: M
Out[63]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 20 stored elements in Compressed Sparse Row format>
In [64]: M.data
Out[64]:
array([ 0.42825407, 0.51858978, 0.8084335 , 0.08691635, 0.13210409,
0.61288928, 0.39675205, 0.58242891, 0.5174367 , 0.57859824,
0.48812484, 0.13472883, 0.82992478, 0.70568697, 0.45001632,
0.52147305, 0.72943809, 0.55801913, 0.97018861, 0.83236235])
You can modify the data
values cheaply without changing the sparsity distribution:
In [65]: M.data -= 0.5
In [66]: M.A
Out[66]:
array([[ 0. , 0. , 0. , -0.07174593, 0. ,
0. , 0. , 0. , 0. , 0. ],
[ 0.01858978, 0. , 0. , 0.3084335 , -0.41308365,
0. , 0. , 0. , 0. , -0.36789591],
[ 0. , 0. , 0. , 0. , 0.11288928,
-0.10324795, 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0.08242891, 0.0174367 , 0. ],
[ 0. , 0. , 0.07859824, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. , -0.01187516, 0. , 0. , -0.36527117],
[ 0. , 0. , 0.32992478, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0.20568697,
0. , 0. , -0.04998368, 0. , 0. ],
[ 0.02147305, 0. , 0.22943809, 0.05801913, 0. ,
0. , 0. , 0. , 0. , 0. ],
[ 0. , 0.47018861, 0.33236235, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ]])
In [67]: np.mean(M.data)
Out[67]: 0.044118297661574338
Or replacing the nonzero values with a new set of values:
In [69]: M.data = np.random.randint(-5,5,20)
In [70]: M
Out[70]:
<10x10 sparse matrix of type '<class 'numpy.int32'>'
with 20 stored elements in Compressed Sparse Row format>
In [71]: M.A
Out[71]:
array([[ 0, 0, 0, 4, 0, 0, 0, 0, 0, 0],
[-1, 0, 0, 1, 2, 0, 0, 0, 0, -4],
[ 0, 0, 0, 0, 0, 4, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, -5, -5, 0],
[ 0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, -3, 0, 0, 3],
[ 0, 0, -1, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, -4, 0, 0, -1, 0, 0],
[-1, 0, -5, -2, 0, 0, 0, 0, 0, 0],
[ 0, 3, 1, 0, 0, 0, 0, 0, 0, 0]])
In [72]: M.data
Out[72]:
array([ 4, -1, 1, 2, -4, 0, 4, -5, -5, 2, -3, 3, -1, -4, -1, -1, -5,
-2, 3, 1])
Upvotes: 1