Reputation: 1554
I have records (rows) in a database and I want to identify similar records. I have a constraint to use cosine similarity. If the variables (attributes, columns) vary in type and come in this form:
[number] [number] [boolean] [20 words string]
how can I proceed to the vectorization to apply the cosine similarity? For the string I can take the simple tf-idf. But for numbers and boolean values?. And how can this be combined? My thought is that the vector would be of 1+1+1+20 length. But is it semantically "efficient" to just transform the numbers of the record to coefficients in my vector and to concatenate them with the tf-idf of the string to compute the cosine similarity? Or i can treat numbers as words and apply tf-idf to numbers as well. Is there another technique?
Upvotes: 2
Views: 973
Reputation: 5247
Each positional element of the vectors must measure a particular attribute/feature of the entities of interest. Frequently, when words are involved, there is a vector element for the count of each word that may appear. Thus, your vector might have the size of 1 + 1 + 1 + (vocabulary size).
Because cosine similarity calculates based on numbers, you might have to convert non-numbers to numbers. For example, you might use 0, 1 for booleans.
You don't mention whether your numeric fields represent measurements or discrete values (e.g., keys). If the numeric values are measurements, then cosine similarity is well-suited (although if there are different scales of the numbers of the different attributes, it can bias your results). However, if the numbers represent keys, then using a single attribute for each field will give poor results, because a key of 5 is no closer to 6 than it is to 200. But cosine similarity doesn't know that. In the case where a database field contains keys, you might want to have a boolean (0, 1) vector element for each possible value.
Upvotes: 1