How to compute cosine similarity on multi-type data?

Question

I have records (rows) in a database and I want to identify similar records. I have a constraint to use cosine similarity. If the variables (attributes, columns) vary in type and come in this form:

[number] [number] [boolean] [20 words string]

how can I proceed to the vectorization to apply the cosine similarity? For the string I can take the simple tf-idf. But for numbers and boolean values?. And how can this be combined? My thought is that the vector would be of 1+1+1+20 length. But is it semantically "efficient" to just transform the numbers of the record to coefficients in my vector and to concatenate them with the tf-idf of the string to compute the cosine similarity? Or i can treat numbers as words and apply tf-idf to numbers as well. Is there another technique?

How to compute cosine similarity on multi-type data?

Answers (1)

Related Questions