Reputation: 33
I'm trying to implement the following equation in R, and I'm having trouble doing so.
My current approach is the following, but I fear it is not correct because the same score is obtained with the perfect match (sorry for the lengthy, unsophisticated code: I'm very new):
query = "acut myeloid leukemia"
document1 = "acut myeloid leukemia normal karyotyp"
document2 = "acut myeloid leukemia"
document3 = "acut normal karyotyp"
Q <- unlist(strsplit(query, " "))
d1 <- unlist(strsplit(document1, " "))
d2 <- unlist(strsplit(document2, " "))
d3 <- unlist(strsplit(document3, " "))
y <- adist(d1,Q)
double_summation1 = 0
for (i in 1:nrow(y-1)) {
for (j in 1:ncol(y-1)) {
double_summation1 = double_summation1 + abs(i-j)
}
}
double_summation1
scatter <- sum(do.call(pmin, lapply(1:nrow(y), function(x)y[x,])))
dist_d_Q1 <- scatter/double_summation1
y <- adist(d2,Q)
double_summation2 = 0
for (i in 1:nrow(y-1)) {
for (j in 1:ncol(y-1)) {
double_summation2 = double_summation2 + abs(i-j)
}
}
double_summation2
scatter <- sum(do.call(pmin, lapply(1:nrow(y), function(x)y[x,])))
dist_d_Q2 <- scatter/double_summation2
y <- adist(d3,Q)
double_summation3 = 0
for (i in 1:nrow(y-1)) {
for (j in 1:ncol(y-1)) {
double_summation3 = double_summation3 + abs(i-j)
}
}
double_summation3
scatter <- sum(do.call(pmin, lapply(1:nrow(y), function(x)y[x,])))
dist_d_Q3 <- scatter/double_summation3
c(dist_d_Q1, dist_d_Q2, dist_d_Q3)
[1] 23
[1] 8
[1] 8
[1] 0.00 0.00 1.75
I realize there are easy ways to do distance measures, such as the stringdist package. But, my goal is to implement the published equation approach to have as a baseline method. Thank you for your time!
Upvotes: 1
Views: 62
Reputation: 23101
Since you need to compute min_dist
and actual_dist
multiple times, you should write them as functions. Also, have your code as close as possible to the algorithm. Something like the following should work:
min_dist <- function(d, Q) {
W <- intersect(d,Q)
n <- length(W)
sum(sapply(0:(n-1), function(i) sapply(0:(n-1), function(j) abs(i-j))))
}
current_dist <- function(d, Q) {
W <- intersect(d,Q)
pos <- sapply(W, function(x)which(Q==x))
n <- length(pos)
sum(sapply(1:n, function(i) sapply(1:n, function(j) abs(pos[i]-pos[j]))))
}
dist_d1_Q <- min_dist(d1, Q) / current_dist(d1, Q)
dist_d2_Q <- min_dist(d2, Q) / current_dist(d2, Q)
dist_d3_Q <- min_dist(d3, Q) / current_dist(d3, Q)
c(dist_d1_Q, dist_d2_Q, dist_d3_Q)
# [1] 1 1 NaN
Upvotes: 0