ladybug_4
ladybug_4

Reputation: 15

Creating graph with iGraph and computing similarity metric

I am trying to create a graph in igraph using a csv file that looks like this:

ID    Element1 Element2 Element3 Element4
12346  A        12       56       2
13007  Y        16       66       2
...    ...      ...      ...      ...

ID column is populated by unique 4-digit identifiers, whereas the element columns are populated by numbers (or letters, in Element1) that are repeated. My goal is to compute the pairwise Jaccard similarity of all the IDs, which leverages the elements shared between the ID nodes. The output should be a NxN matrix.

I have been trying to create the graph on igraph using the graph_from_data_frame feature but this creates nodes from the first two columns and places the remaining columns as edge attributes within the relationships between the nodes it creates. Any ideas on the best way to create a graph that will allow me to compute the Jaccard between the ID nodes?

For reference, the goal is to use this feature of igraph:

 similarity(graph, vids = V(graph), mode = c("all", "out", "in", "total"),
 loops = FALSE, method = c("jaccard", "dice", "invlogweighted"))

where graph is the graph I create and vids are only the ID nodes.

Upvotes: 1

Views: 584

Answers (1)

Lamia
Lamia

Reputation: 3875

If your primary goal is to compute the Jaccard similarity between your IDs, you can do it without the need to create a graph first. The Jaccard similarity is defined as the intersection divided by the union J(A,B) = |A⋂B| / |A⋃B|, which is equivalent to |A⋂B| / ( |A| + |B| - |A⋂B| ). So you can calculate it as follows:

## Example dataset
df = read.table(text = "ID    Element1 Element2 Element3 Element4
  12346  A        12       56       2
  13007  Y        16       66       2
  14008  B        14       56       3
  15078  A        15       56       4
  16000  Y        20       66       3"
,h=T,stringsAsFactors=F)

n = nrow(df)
m = matrix(0,n,n,dimnames=list(df[,1],df[,1]))
for(i in 1:n){
    for(j in i:n){
        m[i,j] = length(intersect(df[i,-1],df[j,-1]))/(2*(n-1)-length(intersect(df[i,-1],df[j,-1])))}}
## Making the matrix symmetrical
m[lower.tri(m)] = t(m)[lower.tri(m)]
> m
          12346     13007     14008     15078     16000
12346 1.0000000 0.1428571 0.1428571 0.3333333 0.0000000
13007 0.1428571 1.0000000 0.0000000 0.0000000 0.3333333
14008 0.1428571 0.0000000 1.0000000 0.1428571 0.1428571
15078 0.3333333 0.0000000 0.1428571 1.0000000 0.0000000
16000 0.0000000 0.3333333 0.1428571 0.0000000 1.0000000

Upvotes: 1

Related Questions