Reputation: 47
Say I have a dataframe that looks like this:
Feature 1 Feature 2 Feature 3 Feature 4 Target
1 1 1 1 a
0 1 0 0 a
0 1 1 1 b
And a vector that looks like this:
0, 1, 1, 1
How would I find the indices of the closest matching rows to the vector? For example, if I wanted to find the 2 closest rows, I would input the vector and the dataframe (perhaps with the target column removed), and I would get indices 1 and 3 as a return from the function, since those rows most closely resemble the vector "0, 1, 1, 1".
I have tried using the "caret" package from R, with the command:
intrain <- createDataPartition(y = data$Target, p= 0.7, list = FALSE)
training <- data[intrain,]
testing <- data[-intrain,]
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
knn_fit <- train(Target~., data = training, method = "knn", trControl = trctrl, preProcess = c("center", "scale"), tuneLength = 10)
test_pred <- predict(knn_fit, newdata = testing)
print(test_pred)
However, this doesn't return the index of the matching rows. It simply returns the predictions for the target that has features most closely matching the testing dataset.
I would like to find a model/command/function that can perform similarly to the KDtrees model from sklearn in python, but in R instead (KDtrees can return a list of the n closest indices). In addition, although not required, I would like said model to work with categorical values for features (such as TRUE/FALSE) so that I don't have to create dummy variables like I've done here with my 1's and 0's.
Upvotes: 0
Views: 320
Reputation: 1378
To find the smallest distances between vectors, you can make a distance matrix:
mat <- matrix(c(1,1,1,1
0,1,0,0,
0,1,1,1,
0,1,1,1),
ncol = 4, byrow = T)
#the following will find the euclidean distance between each row vector
dist(mat, method = "euclidean")
1 2 3
2 1.732051
3 1.000000 1.414214
4 1.000000 1.414214 0.000000
Clearly, the minimum is here between rows 3 and 4 since they are identical
Upvotes: 0
Reputation: 3194
Agreed with 42's comment. With a simple distance metric, row 1 is equally different from the vector as 2.
# your data
featureframe <- data.frame(Feature1 = c(1,0,0), Feature2 = c(1,1,1),
Feature3 = c(1,0,1), Feature4 = c(1,1,1),
Target = c("a","a","b"))
vec <- c(0,1,1,1)
distances <- apply(featureframe[,1:4], 1, function(x) sum((x - vec)^2))
distances
# [1] 1 1 0
Edits as per comments:
To measure categorically what is similar you may instead quantify a similarity metric where the closer the sum is to the lenght of the vector, the closer the two vectors are:
similarity <- apply(featureframe[,1:4], 1, function(x) sum(x == vec))
If you'd like to weight certain features more, you can multiply the similarity vector inside the function by a weight vector of equal length.
similarity <- apply(featureframe[,1:4], 1, function(x) sum((x == vec) * c(1,2,1,1)))
Upvotes: 1