Reputation: 180
I'm using Shogun to run MMD (quadratic) and compare two nonparametric distributions based on their samples (code below is for 1D, but I've also looked at 2D samples). In the toy problem shown below, I try to change the ratio between training and testing samples in the process of selecting an optimized kernel (KSM_MAXIMIZE_MMD is the selection strategy; I've also used KSM_MEDIAN_HEURISTIC). It appears that any ratio other than 1 yields an error.
Am I allowed to change this ratio in this setting? (I see that it is used at: http://www.shogun-toolbox.org/examples/latest/examples/statistical_testing/quadratic_time_mmd.html, but it is set to 1 there)
Concise version of the my code (inspired by the notebook available at: http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html):
import shogun as sg
import numpy as np
from scipy.stats import laplace, norm
n = 220
mu = 0.0
sigma2 = 1
b=np.sqrt(0.5)
X = sg.RealFeatures((norm.rvs(size=n) * np.sqrt(sigma2) + mu).reshape(1,-1))
Y = sg.RealFeatures(laplace.rvs(size=n, loc=mu, scale=b).reshape(1,-1))
mmd = sg.QuadraticTimeMMD(X, Y)
mmd.add_kernel(sg.GaussianKernel(10, 1.0))
mmd.set_kernel_selection_strategy(sg.KSM_MAXIMIZE_MMD)
mmd.set_train_test_mode(True)
mmd.set_train_test_ratio(1)
mmd.select_kernel()
mmd_kernel = sg.GaussianKernel.obtain_from_generic(mmd.get_kernel())
kernel_width = mmd_kernel.get_width()
statistic = mmd.compute_statistic()
p_value = mmd.compute_p_value(statistic)
print p_value
This exact version runs and prints p-values just fine.
If I change the argument passed to mmd.set_train_test_ratio()
from 1 to 2, I get:
SystemErrorTraceback (most recent call last)
<ipython-input-30-dd5fcb933287> in <module>()
25 kernel_width = mmd_kernel.get_width()
26
---> 27 statistic = mmd.compute_statistic()
28 p_value = mmd.compute_p_value(statistic)
29
SystemError: [ERROR] In file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/statistical_testing/internals/mmd/ComputeMMD.h line 90: assertion kernel_matrix.num_rows==size && kernel_matrix.num_cols==size failed in float32_t shogun::internal::mmd::ComputeMMD::operator()(const shogun::SGMatrix<T>&) const [with T = float; float32_t = float] file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/statistical_testing/internals/mmd/ComputeMMD.h line 90
It gets worse, if I use the value below 1. In addition to the following error, jupyter notebook kernel crashes every time (after which I need to rerun the entire notebook; the message says: "The kernel appears to have died. It will restart automatically.").
SystemErrorTraceback (most recent call last)
<ipython-input-31-cb4a5224f4ef> in <module>()
20 mmd.set_train_test_ratio(0.5)
21
---> 22 mmd.select_kernel()
23
24 mmd_kernel = sg.GaussianKernel.obtain_from_generic(mmd.get_kernel())
SystemError: [ERROR] In file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/kernel/Kernel.h line 210: GaussianKernel::kernel(): index out of Range: idx_a=146/146 idx_b=0/146
Complete code (in a jypyter notebook) can be found at: http://nbviewer.jupyter.org/url/dmitry.duplyakin.org/p/jn/kernel-minimal.ipynb
Please let me know if I am missing a step or need to try a different approach.
Side questions:
sg.GaussianKernel(10, <width>)
. I couldn't find more information about the 1st parameter other than its name, cache size. How and when am I supposed to change it?mmd.get_kernel_selection_strategy().get_name()
returns only the generic name, specifically KernelSelectionStrategy
. How can I obtain a more specific name for the selected strategy (e.g., KSM_MEDIAN_HEURISTIC
) from an instance of the sg.QuadraticTimeMMD class?Any relevant information or references will be greatly appreciated.
Shogun version: v6.1.3_2017-12-7_19:14
Upvotes: 1
Views: 301
Reputation: 180
Summary (from comments):
Upvotes: 0
Reputation: 315
The train_test_ratio
attribute is the ratio between the number of samples used in training and the number of samples used in testing. When you have train_test_mode
turned on, the way it decides how many samples to fetch in each mode goes something like this.
num_training_samples = m_num_samples * train_test_ratio / (train_test_ratio + 1)
num_testing_samples = m_num_samples / (train_test_ratio + 1)
It implicitly assumes the divisibility. A train_test_ratio
of 2 would, therefore, try to use 2/3rd of the data for training and 1/3rd of the data for testing, which is problematic for the total number of samples you have, 220. By the logic, it sets num_training_samples
= 146 and num_testing_samples
= 73, which doesn't add up to 220. Similar issues arise when using 0.5 as the train-test ratio. If you use some other values for the train_test_ratio
which splits the total number of samples perfectly, I think these errors would go away.
I am not totally sure but I think the cache makes sense when you're using SVMLight with Shogun. Please check http://svmlight.joachims.org/ for details. From their page
-m [5..] - size of cache for kernel evaluations in MB (default 40)
The larger the faster...
There's no pretty-print for the kernel-selection strategy being used, but you could do mmd.get_kernel_selection_strategy().get_method()
which returns you the enum value (of type EKernelSelectionMethod) which might be helpful. Since it's not documented yet in Shogun api-doc, here's the C++ equivalent for this that you might use.
enum EKernelSelectionMethod
{
KSM_MEDIAN_HEURISTIC,
KSM_MAXIMIZE_MMD,
KSM_MAXIMIZE_POWER,
KSM_CROSS_VALIDATION,
KSM_AUTO = KSM_MAXIMIZE_POWER
};
Upvotes: 1