Reputation: 15
I am trying to create a graph in igraph using a csv file that looks like this:
ID Element1 Element2 Element3 Element4
12346 A 12 56 2
13007 Y 16 66 2
... ... ... ... ...
ID column is populated by unique 4-digit identifiers, whereas the element columns are populated by numbers (or letters, in Element1) that are repeated. My goal is to compute the pairwise Jaccard similarity of all the IDs, which leverages the elements shared between the ID nodes. The output should be a NxN matrix.
I have been trying to create the graph on igraph using the graph_from_data_frame
feature but this creates nodes from the first two columns and places the remaining columns as edge attributes within the relationships between the nodes it creates. Any ideas on the best way to create a graph that will allow me to compute the Jaccard between the ID nodes?
For reference, the goal is to use this feature of igraph:
similarity(graph, vids = V(graph), mode = c("all", "out", "in", "total"),
loops = FALSE, method = c("jaccard", "dice", "invlogweighted"))
where graph
is the graph I create and vids
are only the ID nodes.
Upvotes: 1
Views: 584
Reputation: 3875
If your primary goal is to compute the Jaccard similarity between your IDs, you can do it without the need to create a graph first. The Jaccard similarity is defined as the intersection divided by the union J(A,B) = |A⋂B| / |A⋃B|, which is equivalent to |A⋂B| / ( |A| + |B| - |A⋂B| ). So you can calculate it as follows:
## Example dataset
df = read.table(text = "ID Element1 Element2 Element3 Element4
12346 A 12 56 2
13007 Y 16 66 2
14008 B 14 56 3
15078 A 15 56 4
16000 Y 20 66 3"
,h=T,stringsAsFactors=F)
n = nrow(df)
m = matrix(0,n,n,dimnames=list(df[,1],df[,1]))
for(i in 1:n){
for(j in i:n){
m[i,j] = length(intersect(df[i,-1],df[j,-1]))/(2*(n-1)-length(intersect(df[i,-1],df[j,-1])))}}
## Making the matrix symmetrical
m[lower.tri(m)] = t(m)[lower.tri(m)]
> m
12346 13007 14008 15078 16000
12346 1.0000000 0.1428571 0.1428571 0.3333333 0.0000000
13007 0.1428571 1.0000000 0.0000000 0.0000000 0.3333333
14008 0.1428571 0.0000000 1.0000000 0.1428571 0.1428571
15078 0.3333333 0.0000000 0.1428571 1.0000000 0.0000000
16000 0.0000000 0.3333333 0.1428571 0.0000000 1.0000000
Upvotes: 1