Altamash Rafiq
Altamash Rafiq

Reputation: 339

Jaccard Similarity between strings using a for loop in R

I am trying to compute the jaccard similarity between each pair of names in large vectors of names (see below for small example) and to store their jaccard similarity in a matrix. My function is just returning NULL. What am I doing wrong?

library(dplyr)

df = data.frame(matrix(NA, ncol=3, nrow=3))
df = df %>%
    mutate_if(is.logical, as.numeric)

names(df) = c("A.J. Doyle", "A.J. Graham", "A.J. Porter")
draft_names = names(df) 
row.names(df) = c("A.J. Feeley", "A.J. McCarron", "Aaron Brooks")
quarterback_names = row.names(df)

library(stringdist)

jaccard_similarity = function(d){
  for (i in 1:nrow(d)){
    for(j in 1:ncol(d)){
      d[i,j] = stringdist(quarterback_names[i], draft_names[j], method ='jaccard', q=2)
    }
  }
}

df = jaccard_similarity(df)

Upvotes: 2

Views: 1221

Answers (3)

Morse
Morse

Reputation: 9124

Reason : There is no explict return.

Reference

you can add print and debug like below and trace

jaccard_similarity = function(d){
  for (i in 1:nrow(d)){
    for(j in 1:ncol(d)){
      d[i,j] = stringdist(quarterback_names[i], draft_names[j], method ='jaccard', q=2)
      print(d[i,j])
    }
  }
  return(d)
}

Output:

[1] 0.6428571
[1] 0.75
[1] 0.75
[1] 0.7647059
[1] 0.7777778
[1] 0.7777778
[1] 1
[1] 1
[1] 1

You can simply call jaccard_similarity(df) too get the values.

output  <-jaccard_similarity(df)

              A.J. Doyle A.J. Graham A.J. Porter
A.J. Feeley    0.6428571   0.7500000   0.7500000
A.J. McCarron  0.7647059   0.7777778   0.7777778
Aaron Brooks   1.0000000   1.0000000   1.0000000

And assign the output to new variable rather overriding existing df.

Upvotes: 0

James
James

Reputation: 66834

You are not returning anything after the for loops. Use return(d) at the end of the function.

This problem is also a classic use case for outer:

outer(quarterback_names,draft_names,FUN=stringdist,method="jaccard",q=2)
          [,1]      [,2]      [,3]
[1,] 0.6428571 0.7500000 0.7500000
[2,] 0.7647059 0.7777778 0.7777778
[3,] 1.0000000 1.0000000 1.0000000

Upvotes: 3

Jan
Jan

Reputation: 43169

You need to return your changed dataframe:

jaccard_similarity = function(d){
  for (i in 1:nrow(d)){
    for(j in 1:ncol(d)){
      d[i,j] = stringdist(quarterback_names[i], draft_names[j], method ='jaccard', q=2)
    }
  }
  return(d)
  // ^^^
}


Afterwards jaccard_similarity(df) yields

              A.J. Doyle A.J. Graham A.J. Porter
A.J. Feeley    0.6428571   0.7500000   0.7500000
A.J. McCarron  0.7647059   0.7777778   0.7777778
Aaron Brooks   1.0000000   1.0000000   1.0000000

Upvotes: 2

Related Questions