Xenthor
Xenthor

Reputation: 443

Python - Earth Movers Distance

I want to use the Earth Movers Distance to compare multiple images.
I compared scipy.stats.wasserstein_distance() with pyemd.emd_samples(). As far as I understood the wasserstein_distance() takes two distributions, i.e. histograms, while the emd_samples() takes an 1D array of values and calculates the histograms for you.
Given that both methods use the same histograms, they should provide the same or at least similar results.
The problem is, that both methods provide highly different results. But, if I pass a flattened version of my images in both methods, the results are very similar.

Is there a mistake on my side or is there a problem with one of these implementations?

cat1 = skimage.io.imread("./cat1.jpg", as_grey=True).flatten().astype('float64')
cat2 = skimage.io.imread("./cat2.jpg", as_grey=True).flatten().astype('float64')
shuttle = skimage.io.imread("./shuttle.jpg", as_grey=True).flatten().astype('float64')

emd_s = np.array([[emd_samples(cat1, cat1, bins="fd"), emd_samples(cat1, cat2, bins="fd"), emd_samples(cat1, shuttle, bins="fd")],
                  [emd_samples(cat2, cat1, bins="fd"), emd_samples(cat2, cat2, bins="fd"), emd_samples(cat2, shuttle, bins="fd")],
                  [emd_samples(shuttle, cat1, bins="fd"), emd_samples(shuttle, cat2, bins="fd"), emd_samples(shuttle, shuttle, bins="fd")]])

pmf_cat1, bins_cat1 = np.histogram(cat1 , bins="fd")
pmf_cat2, bins_cat2 = np.histogram(cat2 , bins="fd")
pmf_shuttle, bins_shuttle = np.histogram(shuttle , bins="fd")

emd_s2 = np.array([[emd_samples(pmf_cat1, pmf_cat1, bins="fd"), emd_samples(pmf_cat1, pmf_cat2, bins="fd"), emd_samples(pmf_cat1, pmf_shuttle, bins="fd")],
                  [emd_samples(pmf_cat2, pmf_cat1, bins="fd"), emd_samples(pmf_cat2, pmf_cat2, bins="fd"), emd_samples(pmf_cat2, pmf_shuttle, bins="fd")],
                  [emd_samples(pmf_shuttle, pmf_cat1, bins="fd"), emd_samples(pmf_shuttle, pmf_cat2, bins="fd"), emd_samples(pmf_shuttle, pmf_shuttle, bins="fd")]])

swd = np.array([[wasserstein_distance(pmf_cat1, pmf_cat1), wasserstein_distance(pmf_cat1, pmf_cat2), wasserstein_distance(pmf_cat1, pmf_shuttle)],
                [wasserstein_distance(pmf_cat2, pmf_cat1), wasserstein_distance(pmf_cat2, pmf_cat2), wasserstein_distance(pmf_cat2, pmf_shuttle)],
                [wasserstein_distance(pmf_shuttle, pmf_cat1), wasserstein_distance(pmf_shuttle, pmf_cat2), wasserstein_distance(pmf_shuttle, pmf_shuttle)]])

swd2 = np.array([[wasserstein_distance(cat1, cat1), wasserstein_distance(cat1, cat2), wasserstein_distance(cat1, shuttle)],
                [wasserstein_distance(cat2, cat1), wasserstein_distance(cat2, cat2), wasserstein_distance(cat2, shuttle)],
                [wasserstein_distance(shuttle, cat1), wasserstein_distance(shuttle, cat2), wasserstein_distance(shuttle, shuttle)]])

The above example results in similar results for emd_s and swd2 and somehow similar results for emd_s2 and swd, although the last pair is still quite different because technically emd_samples should make a histogram based on histograms in this case.

Upvotes: 3

Views: 3702

Answers (1)

eMCeeEs
eMCeeEs

Reputation: 29

I came across a similar problem and like to note a couple of things here.

  1. Both function emd_samples and wasserstein_distance take the values observed in an (empirical) distribution as input and NOT the distribution itself.

  2. The function emd allows you to pass distributions, however, you need to provide the metric as an additional parameter. Also, when working with histograms as (density) distributions you need to normalize them.

  3. Not flattening the grey-scale images means that you compare 2D-histgramms which only works with pyemd.

Example usage:

import numpy as np
import skimage
import os

from pyemd import emd, emd_samples
from scipy.stats import wasserstein_distance

# get some test images
img1 = skimage.io.imread(os.path.join(skimage.data_dir, 'astronaut.png'))
img2 = skimage.io.imread(os.path.join(skimage.data_dir, 'camera.png'))
img3 = skimage.io.imread(os.path.join(skimage.data_dir, 'horse.png'))

# flatten them
images = [img.ravel() for img in [img1, img2, img3]]

# compute EMD using values
emd_samples(images[0], images[1]) # 25.57794401220945
wasserstein_distance(images[0], images[1]) # 25.76187896728515

# compute EMD using distributions
N_BINS = 256
hists = [np.histogram(img, N_BINS, density=True)[0].astype(np.float64) for img in images]

mgrid = np.meshgrid(np.arange(N_BINS), np.arange(N_BINS))
metric = np.abs(mgrid[0] - mgrid[1]).astype(np.float64)

emd(hists[0], hists[1], metric) # 25.862491463680065

Upvotes: 2

Related Questions