Python/Shogun Toolbox: Convert RealFeatures to StreamingRealFeatures

Question

I am using the Python version of the Shogun Toolbox. I want to use the LinearTimeMMD, which accepts data under the streaming interface CStreamingFeatures. I have the data in the form of two RealFeatures objects: feat_p and feat_q. These work just fine with the QuadraticTimeMMD.

In order to use it with the LinearTimeMMD, I need to create StreamingFeatures objects from these - In this case, these would be StreamingRealFeatures, as far as I know.

My first approach was using this:

gen_p, gen_q = StreamingRealFeatures(feat_p), StreamingRealFeatures(feat_q)

This however does not seem to work: The LinearTimeMMD delivers warnings and an unrealistic result (growing constantly with the number of samples) and calling gen_p.get_dim_feature_space() returns -1. Also, if I try calling gen_p.get_streamed_features(100) this results in a Memory Access Error.

I tried another approach using StreamingFileFromFeatures:

streamFile_p = sg.StreamingFileFromRealFeatures()
streamFile_p.set_features(feat_p)
streamFile_q = sg.StreamingFileFromRealFeatures()
streamFile_q.set_features(feat_q)

gen_p = StreamingRealFeatures(streamFile_p, False, 100)
gen_q = StreamingRealFeatures(streamFile_q, False, 100)

But this results in the same situation with the same described problems. It seems that in both cases, the contents of the RealFeatures object handed to the StreamingRealFeatures object cannot be accessed. What am I doing wrong?

EDIT: I was asked for a small working example to show the error:

import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import shogun as sg
from shogun import StreamingRealFeatures
import numpy as np

from matplotlib import pyplot as plt

from scipy.stats import laplace, norm

def sample_gaussian_vs_laplace(n=220, mu=0.0, sigma2=1, b=np.sqrt(0.5)):    
    # sample from both distributions
    X=norm.rvs(size=n)*np.sqrt(sigma2)+mu
    Y=laplace.rvs(size=n, loc=mu, scale=b)

    return X,Y


# Main Script
mu=0.0
sigma2=1
b=np.sqrt(0.5)
n=220
X,Y=sample_gaussian_vs_laplace(n, mu, sigma2, b)

# turn data into Shogun representation (columns vectors)
feat_p=sg.RealFeatures(X.reshape(1,len(X)))
feat_q=sg.RealFeatures(Y.reshape(1,len(Y)))

gen_p, gen_q = StreamingRealFeatures(feat_p), StreamingRealFeatures(feat_q)

print("Dimensions: ", gen_p.get_dim_feature_space())
print("Number of features: ", gen_p.get_num_features())
print("Number of vectors: ", gen_p.get_num_vectors())

test_features = gen_p.get_streamed_features(1)

print("success")

EDIT 2: The Output of the working example:

Dimensions:  -1
Number of features:  -1
Number of vectors:  1
Speicherzugriffsfehler (Speicherabzug geschrieben)

EDIT 3: Additional Code with LinearTimeMMD using the RealFeatures directly.

mmd = sg.LinearTimeMMD()
kernel = sg.GaussianKernel(10, 1)
mmd.set_kernel(kernel)
mmd.set_p(feat_p)
mmd.set_q(feat_q)
mmd.set_num_samples_p(1000)
mmd.set_num_samples_q(1000)
alpha = 0.05

# Code taken from notebook example on
# http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html
# Location on page: In[16]

block_size=100
mmd.set_num_blocks_per_burst(block_size)

# compute an unbiased estimate in linear time
statistic=mmd.compute_statistic()
print("MMD_l[X,Y]^2=%.2f" % statistic)

EDIT 4: Additional code sample showing the growing mmd problem:

import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import shogun as sg
from shogun import StreamingRealFeatures
import numpy as np

from matplotlib import pyplot as plt

def mmd(n):

    X = [(1.0,i) for i in range(n)]
    Y = [(2.0,i) for i in range(n)]

    X = np.array(X)
    Y = np.array(Y)

    # turn data into Shogun representation (columns vectors)
    feat_p=sg.RealFeatures(X.reshape(2, len(X)))
    feat_q=sg.RealFeatures(Y.reshape(2, len(Y)))

    mmd = sg.LinearTimeMMD()
    kernel = sg.GaussianKernel(10, 1)
    mmd.set_kernel(kernel)
    mmd.set_p(feat_p)
    mmd.set_q(feat_q)
    mmd.set_num_samples_p(100)
    mmd.set_num_samples_q(100)
    alpha = 0.05
    block_size=100
    mmd.set_num_blocks_per_burst(block_size)

    # compute an unbiased estimate in linear time
    statistic=mmd.compute_statistic()
    print("N =", n)
    print("MMD_l[X,Y]^2=%.2f" % statistic)
    print()

for n in [1000, 10000, 15000, 20000, 25000, 30000]:
    mmd(n)

Output:

N = 1000
MMD_l[X,Y]^2=-12.69

N = 10000
MMD_l[X,Y]^2=-40.14

N = 15000
MMD_l[X,Y]^2=-49.16

N = 20000
MMD_l[X,Y]^2=-56.77

N = 25000
MMD_l[X,Y]^2=-63.47

N = 30000
MMD_l[X,Y]^2=-69.52

Python/Shogun Toolbox: Convert RealFeatures to StreamingRealFeatures

Answers (1)

Related Questions