Johnny521
Johnny521

Reputation: 61

How to measure distance when applying MDS

Hi I have a very specific, weird question about applying MDS with Python.

When creating a distance matrix of the original high dimensional dataset (let’s call it distanceHD) you can either measure those distances between all data points with Euclidean or Manhattan distance.

Then, after performing MDS, let’s say I brought my 70+ columns down to 2 columns. Now, I can create a NEW distance matrix. Let’s call it distance2D which measures those distances between data points again, either in Manhattan or Euclidean.

Finally, I can find the difference between the two distance matrices (between distanceHD and distance2D) and this new difference matrix will show me if I preserved the distances between data points from the large dataset, into the new dataset with fewer columns. (after performing MDS). I can then compute the stress using the stress function on that difference matrix and the closer the 0 that number is, the better the projection is.

My question: I was originally taught to use Manhattan distance in the distanceHD matrix, and to use the Euclidean distance in the distance2D matrix. But WHY? Why not use Manhattan on both? Or Euclidean on both? Or Euclidean on distanceHD and Manhattan distance on distance2D?

I guess overall question along with that: when do I use either distance metric on the MDS algorithm?

Sorry for the long and probably confusing post. I have an example displayed below:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dataHD = pd.DataFrame(
    [[0,0,0,0],
     [1,1,1,1],
     [0,1,2,3],
     [0,0,0,1]],
    index=['A','B','C','D'], 
    columns=['1','2','3','4'])
dataHD

import sklearn.metrics.pairwise as smp

distHD = smp.manhattan_distances(dataHD) #L1 Distance Function
distHD = pd.DataFrame(distHD, columns=dataHD.index, index=dataHD.index)
distHD

import sklearn.manifold

# Here were going to find the local min/maxs
# the disimilarity parameter is referencing the distance matrix
# shift + tab will show parameters

# n_init: Number of times the k-means algorithm will be run with different centroid seeds. 
#         The final results will be the best output of n_init consecutive runs in terms of inertia.

# max_iter: Maximum number of iterations of the k-means algorithm for a single run.

mds = sklearn.manifold.MDS(dissimilarity = 'precomputed', n_init=10, max_iter=1000)

# NOTE: you will get different numbers everytime you run this. this is because youll 
#       find different local mins
# The key takeaway here is that the distance between data points are preserved
data2D = mds.fit_transform(distHD)

# Recall: were using new columns that summarize the distHD table..pick new column names
data2D = pd.DataFrame(data2D, columns=['x', 'y'], index = dataHD.index)
data2D

## Plot the MDS 2D result
%matplotlib inline
ax = data2D.plot.scatter(x='x', y='y')

# How to label those data points
ax.text(data2D.x[0], data2D.y[0], 'A')
ax.text(data2D.x[1], data2D.y[1], 'B')
ax.text(data2D.x[2], data2D.y[2], 'C')
ax.text(data2D.x[3], data2D.y[3], 'D')

dist2D = sklearn.metrics.euclidean_distances(data2D)
dist2D = pd.DataFrame(dist2D, columns = data2D.index, index = data2D.index)
dist2D

## Stress function...the formula given above
np.sqrt(((distHD - dist2D) **2).sum().sum() / (distHD**2).sum().sum())

Upvotes: 2

Views: 932

Answers (0)

Related Questions