Adam Alayli
Adam Alayli

Reputation: 47

How to find the row in a dataframe that most closely resembles a given vector

Say I have a dataframe that looks like this:

Feature 1     Feature 2     Feature 3     Feature 4     Target
    1             1             1             1            a
    0             1             0             0            a 
    0             1             1             1            b

And a vector that looks like this:

0, 1, 1, 1

How would I find the indices of the closest matching rows to the vector? For example, if I wanted to find the 2 closest rows, I would input the vector and the dataframe (perhaps with the target column removed), and I would get indices 1 and 3 as a return from the function, since those rows most closely resemble the vector "0, 1, 1, 1".

I have tried using the "caret" package from R, with the command:

intrain <- createDataPartition(y = data$Target, p= 0.7, list = FALSE)
training <- data[intrain,]
testing <- data[-intrain,]

trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
knn_fit <- train(Target~., data = training, method = "knn", trControl = trctrl, preProcess = c("center", "scale"), tuneLength = 10)
test_pred <- predict(knn_fit, newdata = testing)
print(test_pred)

However, this doesn't return the index of the matching rows. It simply returns the predictions for the target that has features most closely matching the testing dataset.

I would like to find a model/command/function that can perform similarly to the KDtrees model from sklearn in python, but in R instead (KDtrees can return a list of the n closest indices). In addition, although not required, I would like said model to work with categorical values for features (such as TRUE/FALSE) so that I don't have to create dummy variables like I've done here with my 1's and 0's.

Upvotes: 0

Views: 320

Answers (2)

Dij
Dij

Reputation: 1378

To find the smallest distances between vectors, you can make a distance matrix:

mat <- matrix(c(1,1,1,1
                0,1,0,0,
                0,1,1,1,
                0,1,1,1), 
              ncol = 4, byrow = T)
#the following will find the euclidean distance between each row vector
dist(mat, method = "euclidean")
         1        2        3
2 1.732051                  
3 1.000000 1.414214         
4 1.000000 1.414214 0.000000

Clearly, the minimum is here between rows 3 and 4 since they are identical

Upvotes: 0

Evan Friedland
Evan Friedland

Reputation: 3194

Agreed with 42's comment. With a simple distance metric, row 1 is equally different from the vector as 2.

# your data
featureframe <- data.frame(Feature1 = c(1,0,0), Feature2 = c(1,1,1), 
                           Feature3 = c(1,0,1), Feature4 = c(1,1,1), 
                           Target = c("a","a","b"))
vec <- c(0,1,1,1)

distances <- apply(featureframe[,1:4], 1, function(x) sum((x - vec)^2))
distances
# [1] 1 1 0

Edits as per comments:

To measure categorically what is similar you may instead quantify a similarity metric where the closer the sum is to the lenght of the vector, the closer the two vectors are:

similarity <- apply(featureframe[,1:4], 1, function(x) sum(x == vec))

If you'd like to weight certain features more, you can multiply the similarity vector inside the function by a weight vector of equal length.

similarity <- apply(featureframe[,1:4], 1, function(x) sum((x == vec) * c(1,2,1,1)))

Upvotes: 1

Related Questions