Reputation: 11
I am using the Python version of the Shogun Toolbox.
I want to use the LinearTimeMMD
, which accepts data under the streaming interface CStreamingFeatures
. I have the data in the form of two RealFeatures
objects: feat_p
and feat_q
. These work just fine with the QuadraticTimeMMD
.
In order to use it with the LinearTimeMMD, I need to create StreamingFeatures
objects from these - In this case, these would be StreamingRealFeatures
, as far as I know.
My first approach was using this:
gen_p, gen_q = StreamingRealFeatures(feat_p), StreamingRealFeatures(feat_q)
This however does not seem to work: The LinearTimeMMD delivers warnings and an unrealistic result (growing constantly with the number of samples) and calling gen_p.get_dim_feature_space()
returns -1
. Also, if I try calling gen_p.get_streamed_features(100)
this results in a Memory Access Error.
I tried another approach using StreamingFileFromFeatures
:
streamFile_p = sg.StreamingFileFromRealFeatures()
streamFile_p.set_features(feat_p)
streamFile_q = sg.StreamingFileFromRealFeatures()
streamFile_q.set_features(feat_q)
gen_p = StreamingRealFeatures(streamFile_p, False, 100)
gen_q = StreamingRealFeatures(streamFile_q, False, 100)
But this results in the same situation with the same described problems.
It seems that in both cases, the contents of the RealFeatures
object handed to the StreamingRealFeatures
object cannot be accessed.
What am I doing wrong?
EDIT: I was asked for a small working example to show the error:
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import shogun as sg
from shogun import StreamingRealFeatures
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import laplace, norm
def sample_gaussian_vs_laplace(n=220, mu=0.0, sigma2=1, b=np.sqrt(0.5)):
# sample from both distributions
X=norm.rvs(size=n)*np.sqrt(sigma2)+mu
Y=laplace.rvs(size=n, loc=mu, scale=b)
return X,Y
# Main Script
mu=0.0
sigma2=1
b=np.sqrt(0.5)
n=220
X,Y=sample_gaussian_vs_laplace(n, mu, sigma2, b)
# turn data into Shogun representation (columns vectors)
feat_p=sg.RealFeatures(X.reshape(1,len(X)))
feat_q=sg.RealFeatures(Y.reshape(1,len(Y)))
gen_p, gen_q = StreamingRealFeatures(feat_p), StreamingRealFeatures(feat_q)
print("Dimensions: ", gen_p.get_dim_feature_space())
print("Number of features: ", gen_p.get_num_features())
print("Number of vectors: ", gen_p.get_num_vectors())
test_features = gen_p.get_streamed_features(1)
print("success")
EDIT 2: The Output of the working example:
Dimensions: -1
Number of features: -1
Number of vectors: 1
Speicherzugriffsfehler (Speicherabzug geschrieben)
EDIT 3: Additional Code with LinearTimeMMD using the RealFeatures directly.
mmd = sg.LinearTimeMMD()
kernel = sg.GaussianKernel(10, 1)
mmd.set_kernel(kernel)
mmd.set_p(feat_p)
mmd.set_q(feat_q)
mmd.set_num_samples_p(1000)
mmd.set_num_samples_q(1000)
alpha = 0.05
# Code taken from notebook example on
# http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html
# Location on page: In[16]
block_size=100
mmd.set_num_blocks_per_burst(block_size)
# compute an unbiased estimate in linear time
statistic=mmd.compute_statistic()
print("MMD_l[X,Y]^2=%.2f" % statistic)
EDIT 4: Additional code sample showing the growing mmd problem:
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import shogun as sg
from shogun import StreamingRealFeatures
import numpy as np
from matplotlib import pyplot as plt
def mmd(n):
X = [(1.0,i) for i in range(n)]
Y = [(2.0,i) for i in range(n)]
X = np.array(X)
Y = np.array(Y)
# turn data into Shogun representation (columns vectors)
feat_p=sg.RealFeatures(X.reshape(2, len(X)))
feat_q=sg.RealFeatures(Y.reshape(2, len(Y)))
mmd = sg.LinearTimeMMD()
kernel = sg.GaussianKernel(10, 1)
mmd.set_kernel(kernel)
mmd.set_p(feat_p)
mmd.set_q(feat_q)
mmd.set_num_samples_p(100)
mmd.set_num_samples_q(100)
alpha = 0.05
block_size=100
mmd.set_num_blocks_per_burst(block_size)
# compute an unbiased estimate in linear time
statistic=mmd.compute_statistic()
print("N =", n)
print("MMD_l[X,Y]^2=%.2f" % statistic)
print()
for n in [1000, 10000, 15000, 20000, 25000, 30000]:
mmd(n)
Output:
N = 1000
MMD_l[X,Y]^2=-12.69
N = 10000
MMD_l[X,Y]^2=-40.14
N = 15000
MMD_l[X,Y]^2=-49.16
N = 20000
MMD_l[X,Y]^2=-56.77
N = 25000
MMD_l[X,Y]^2=-63.47
N = 30000
MMD_l[X,Y]^2=-69.52
Upvotes: 0
Views: 422
Reputation: 315
For some reason, the pythonenv in my machine is broken. So, I couldn't give a snippet in Python. But let me point to a working example in C++ which attempts to address the issues (https://gist.github.com/lambday/983830beb0afeb38b9447fd91a143e67).
StreamingRealFeatures
instance directly from RealFeatures
instance (like you tried the first time). Check test1()
and test2()
methods in the gist which shows the equivalence of using RealFeatures
and StreamingRealFeatures
in the use-case in question. The reason you were getting weird results when streaming directly is that in order to start the streaming process we need to call the start_parser
method in the StreamingRealFeatures
class. We handle these technicalities internally inside MMD
classes. But when trying to use it directly, we need to invoke that separately (See test3()
method in my attached example).compute_statistic()
method doesn't return MMD directly, but rather returns \frac{n_x\times n_y}{n_x+n_y}\times MMD^2
(as mentioned in the doc http://shogun.ml/api/latest/classshogun_1_1CMMD.html). With that in mind, maybe the results you are getting for varying number of samples make sense.Hope it helps.
Upvotes: 1