Jonathan D. Stallings
Jonathan D. Stallings

Reputation: 33

Proximity scoring of two strings with different lengths (implementing double summation in R)

I'm trying to implement the following equation in R, and I'm having trouble doing so.

double summation equation

My current approach is the following, but I fear it is not correct because the same score is obtained with the perfect match (sorry for the lengthy, unsophisticated code: I'm very new):

query = "acut myeloid leukemia"
document1 = "acut myeloid leukemia normal karyotyp"
document2 = "acut myeloid leukemia"
document3 = "acut normal karyotyp"

Q <- unlist(strsplit(query, " "))
d1 <- unlist(strsplit(document1, " "))  
d2 <- unlist(strsplit(document2, " "))  
d3 <- unlist(strsplit(document3, " "))  

y <- adist(d1,Q)
double_summation1 = 0
for (i in 1:nrow(y-1)) {
    for (j in 1:ncol(y-1)) {
    double_summation1 = double_summation1 + abs(i-j)
    }
}
double_summation1
scatter <- sum(do.call(pmin, lapply(1:nrow(y), function(x)y[x,])))
dist_d_Q1 <- scatter/double_summation1

y <- adist(d2,Q)
double_summation2 = 0
for (i in 1:nrow(y-1)) {
    for (j in 1:ncol(y-1)) {
    double_summation2 = double_summation2 + abs(i-j)
    }
}
double_summation2
scatter <- sum(do.call(pmin, lapply(1:nrow(y), function(x)y[x,])))
dist_d_Q2 <- scatter/double_summation2

y <- adist(d3,Q)
double_summation3 = 0
for (i in 1:nrow(y-1)) {
    for (j in 1:ncol(y-1)) {
    double_summation3 = double_summation3 + abs(i-j)
    }
}
double_summation3
scatter <- sum(do.call(pmin, lapply(1:nrow(y), function(x)y[x,])))
dist_d_Q3 <- scatter/double_summation3

c(dist_d_Q1, dist_d_Q2, dist_d_Q3)

[1] 23
[1] 8
[1] 8
[1] 0.00 0.00 1.75

I realize there are easy ways to do distance measures, such as the stringdist package. But, my goal is to implement the published equation approach to have as a baseline method. Thank you for your time!

Upvotes: 1

Views: 62

Answers (1)

Sandipan Dey
Sandipan Dey

Reputation: 23101

Since you need to compute min_dist and actual_dist multiple times, you should write them as functions. Also, have your code as close as possible to the algorithm. Something like the following should work:

min_dist <- function(d, Q) {
  W <- intersect(d,Q)
  n <- length(W)
  sum(sapply(0:(n-1), function(i) sapply(0:(n-1), function(j) abs(i-j))))
}

current_dist <- function(d, Q) {
  W <- intersect(d,Q)
  pos <- sapply(W, function(x)which(Q==x))
  n <- length(pos)
  sum(sapply(1:n, function(i) sapply(1:n, function(j) abs(pos[i]-pos[j]))))
}

dist_d1_Q <- min_dist(d1, Q) / current_dist(d1, Q)
dist_d2_Q <- min_dist(d2, Q) / current_dist(d2, Q)
dist_d3_Q <- min_dist(d3, Q) / current_dist(d3, Q)

c(dist_d1_Q, dist_d2_Q, dist_d3_Q)
# [1]   1   1 NaN

Upvotes: 0

Related Questions