Textilienhersteller
Textilienhersteller

Reputation: 11

Python/Shogun Toolbox: Convert RealFeatures to StreamingRealFeatures

I am using the Python version of the Shogun Toolbox. I want to use the LinearTimeMMD, which accepts data under the streaming interface CStreamingFeatures. I have the data in the form of two RealFeatures objects: feat_p and feat_q. These work just fine with the QuadraticTimeMMD.

In order to use it with the LinearTimeMMD, I need to create StreamingFeatures objects from these - In this case, these would be StreamingRealFeatures, as far as I know.

My first approach was using this:

gen_p, gen_q = StreamingRealFeatures(feat_p), StreamingRealFeatures(feat_q)

This however does not seem to work: The LinearTimeMMD delivers warnings and an unrealistic result (growing constantly with the number of samples) and calling gen_p.get_dim_feature_space() returns -1. Also, if I try calling gen_p.get_streamed_features(100) this results in a Memory Access Error.

I tried another approach using StreamingFileFromFeatures:

streamFile_p = sg.StreamingFileFromRealFeatures()
streamFile_p.set_features(feat_p)
streamFile_q = sg.StreamingFileFromRealFeatures()
streamFile_q.set_features(feat_q)

gen_p = StreamingRealFeatures(streamFile_p, False, 100)
gen_q = StreamingRealFeatures(streamFile_q, False, 100)

But this results in the same situation with the same described problems. It seems that in both cases, the contents of the RealFeatures object handed to the StreamingRealFeatures object cannot be accessed. What am I doing wrong?

EDIT: I was asked for a small working example to show the error:

import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import shogun as sg
from shogun import StreamingRealFeatures
import numpy as np

from matplotlib import pyplot as plt

from scipy.stats import laplace, norm

def sample_gaussian_vs_laplace(n=220, mu=0.0, sigma2=1, b=np.sqrt(0.5)):    
    # sample from both distributions
    X=norm.rvs(size=n)*np.sqrt(sigma2)+mu
    Y=laplace.rvs(size=n, loc=mu, scale=b)

    return X,Y


# Main Script
mu=0.0
sigma2=1
b=np.sqrt(0.5)
n=220
X,Y=sample_gaussian_vs_laplace(n, mu, sigma2, b)

# turn data into Shogun representation (columns vectors)
feat_p=sg.RealFeatures(X.reshape(1,len(X)))
feat_q=sg.RealFeatures(Y.reshape(1,len(Y)))

gen_p, gen_q = StreamingRealFeatures(feat_p), StreamingRealFeatures(feat_q)

print("Dimensions: ", gen_p.get_dim_feature_space())
print("Number of features: ", gen_p.get_num_features())
print("Number of vectors: ", gen_p.get_num_vectors())

test_features = gen_p.get_streamed_features(1)

print("success")

EDIT 2: The Output of the working example:

Dimensions:  -1
Number of features:  -1
Number of vectors:  1
Speicherzugriffsfehler (Speicherabzug geschrieben)

EDIT 3: Additional Code with LinearTimeMMD using the RealFeatures directly.

mmd = sg.LinearTimeMMD()
kernel = sg.GaussianKernel(10, 1)
mmd.set_kernel(kernel)
mmd.set_p(feat_p)
mmd.set_q(feat_q)
mmd.set_num_samples_p(1000)
mmd.set_num_samples_q(1000)
alpha = 0.05

# Code taken from notebook example on
# http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html
# Location on page: In[16]

block_size=100
mmd.set_num_blocks_per_burst(block_size)

# compute an unbiased estimate in linear time
statistic=mmd.compute_statistic()
print("MMD_l[X,Y]^2=%.2f" % statistic)

EDIT 4: Additional code sample showing the growing mmd problem:

import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import shogun as sg
from shogun import StreamingRealFeatures
import numpy as np

from matplotlib import pyplot as plt

def mmd(n):

    X = [(1.0,i) for i in range(n)]
    Y = [(2.0,i) for i in range(n)]

    X = np.array(X)
    Y = np.array(Y)

    # turn data into Shogun representation (columns vectors)
    feat_p=sg.RealFeatures(X.reshape(2, len(X)))
    feat_q=sg.RealFeatures(Y.reshape(2, len(Y)))

    mmd = sg.LinearTimeMMD()
    kernel = sg.GaussianKernel(10, 1)
    mmd.set_kernel(kernel)
    mmd.set_p(feat_p)
    mmd.set_q(feat_q)
    mmd.set_num_samples_p(100)
    mmd.set_num_samples_q(100)
    alpha = 0.05
    block_size=100
    mmd.set_num_blocks_per_burst(block_size)

    # compute an unbiased estimate in linear time
    statistic=mmd.compute_statistic()
    print("N =", n)
    print("MMD_l[X,Y]^2=%.2f" % statistic)
    print()

for n in [1000, 10000, 15000, 20000, 25000, 30000]:
    mmd(n)

Output:

N = 1000
MMD_l[X,Y]^2=-12.69

N = 10000
MMD_l[X,Y]^2=-40.14

N = 15000
MMD_l[X,Y]^2=-49.16

N = 20000
MMD_l[X,Y]^2=-56.77

N = 25000
MMD_l[X,Y]^2=-63.47

N = 30000
MMD_l[X,Y]^2=-69.52

Upvotes: 0

Views: 422

Answers (1)

Soumyajit De
Soumyajit De

Reputation: 315

For some reason, the pythonenv in my machine is broken. So, I couldn't give a snippet in Python. But let me point to a working example in C++ which attempts to address the issues (https://gist.github.com/lambday/983830beb0afeb38b9447fd91a143e67).

  • I think the easiest way is to create a StreamingRealFeatures instance directly from RealFeatures instance (like you tried the first time). Check test1() and test2() methods in the gist which shows the equivalence of using RealFeatures and StreamingRealFeatures in the use-case in question. The reason you were getting weird results when streaming directly is that in order to start the streaming process we need to call the start_parser method in the StreamingRealFeatures class. We handle these technicalities internally inside MMD classes. But when trying to use it directly, we need to invoke that separately (See test3() method in my attached example).
  • Please note that the compute_statistic() method doesn't return MMD directly, but rather returns \frac{n_x\times n_y}{n_x+n_y}\times MMD^2 (as mentioned in the doc http://shogun.ml/api/latest/classshogun_1_1CMMD.html). With that in mind, maybe the results you are getting for varying number of samples make sense.

Hope it helps.

Upvotes: 1

Related Questions