Similarity measures for vectors containing mixed data type (discrete and continuous)

I have a set of vectors each of which contain both textual and numeric elements. I am looking for similarity measures for such vectors and if possible their implemented frameworks. Any help much appreciated.

Upvotes: 4

Answers (2)

nbonneel

Reputation: 3326

A nice metric for textual data is the Levenshtein distance (or edit distance) that counts how much you should change a string to obtain the other string. In a less computationally intensive way, there is the Hamming distance which provides a similar metric but requiring the strings to have the same size. Converting letters to their ASCII representation is unlikely to give relevant results (or it depends your application and your use of the distance) : is "Z" closer to "S" or to "A" ?

Combined with an Euclidean distance for your numeric data (if you expect them to lie in the Euclidean plane... this might not be the case if they represent coordinates on Earth, angles, etc.), you can sum and weight all the squared distances to obtain a final metric. For instance, you will get d(A,B) = sqrt( weight1*Levenshtein(textA, textB)^2 + weight2*Euclidean(numericA, numericB)^2)

Now the problem arises about how to set such weights. For instance, if you are measuring tiny numeric data in kilometers and you compute edit distances with very long strings, the numeric data will almost be irrelevant, so you would need to weigh them more. This is domain specific, and only you can choose such weights depending on your data and your applications.

At the end, all depends on your applications that you did not specify, and your data that you didn't mention what they represent. An application can be to build an acceleration structure - in such a case, any not-too-stupid metric could work (including converting letters to ASCII numbers) ; or it could be to query a database or display these points, for which it would matter more. For your data, numeric data could represent coordinates on a plane or on the earth (and that would change the metric), and textual data could be a single letter that you want to check how much similar it sounds to another one, or a full text which could be off by a few letters to another text... Without more precision, it's hard to tell.

Upvotes: 0

doug

Reputation: 70028

To me this is a data modeling problem rather than one of finding an appropriate similiarty metric.

for instance, you can use euclidean distance provided that you

re-scale your data (e.g., mean-centered & unit variance); and
re-code the "textual" elements (by which i assume you mean discrete variables such as a field storing gender with values of male and female)

so for instance, imagine a dataset comprised of data vectors each with four features (columns or fields):

minutes_per_session, sessions_per_week, registered_user, sex

The first two are continuous (aka "numeric") variables--i.e., proper values are 12.5, 4.7 and so on.

the second two are discrete and obviously require transformation.

step 1: recoding discrete variables

The common technique is to re-code each discrete feature into a sequence of features, once feature for each value recorded for that feature (and in which each feature is given the name of a value of that original feature).

hence a single column storing the sex of each user might have values of M and F would be transformed into two features (fields or columns) because sex has two possible values.

so the column of values for user sex:

 ['M']
 ['M']
 ['F']
 ['M']
 ['M']
 ['F']
 ['F']
 ['M']
 ['M']
 ['M']

becomes two columns

[1, 0]
[1, 0]
[0, 1]
[1, 0]
[1, 0]
[0, 1]
[0, 1]
[1, 0]
[1, 0]
[1, 0]

step 2: re-scaling the data (e.g., mean-centered and unit-variance)

a random-generated 2D array for synthetic data:

     array([[ 3.,  5.,  2.,  4.],
            [ 9.,  2.,  0.,  8.],
            [ 5.,  1.,  8.,  0.],
            [ 9.,  9.,  7.,  4.],
            [ 3.,  1.,  6.,  2.]])

for each column: calculate the mean
then subtract the mean from each value in that column:

>>> A -= A.mean(axis=0)
>>> A
      array([[-2.8,  1.4, -2.6,  0.4],
             [ 3.2, -1.6, -4.6,  4.4],
             [-0.8, -2.6,  3.4, -3.6],
             [ 3.2,  5.4,  2.4,  0.4],
             [-2.8, -2.6,  1.4, -1.6]])

for each column:now calculate the *standard deviation*
then divide each value in that column by this std:

>>> A /= A.std(axis=0)

verify:

>>> A.mean(axis=0)
      array([ 0., -0.,  0., -0.])

>>> A.std(axis=0)
      array([ 1.,  1.,  1.,  1.])

so the original array comprised of four columns now has six; pair-wise similarity can be measured by Euclidean distance, like so:

take the first data vectors (rows):
>>> v1, v2 = A1[:2,:]

Euclidean distance, for a 2-feature space:

dist = ( (x2 - x1)**2 + (y2 - y1)**2 )**0.5

>>> sm = NP.sum((v2 - v1)**2)**.5
>>> sm
      3.79

Upvotes: 3

Similarity measures for vectors containing mixed data type (discrete and continuous)

Answers (2)

step 1: recoding discrete variables

step 2: re-scaling the data (e.g., mean-centered and unit-variance)

Related Questions