wake_wake
wake_wake

Reputation: 1204

R - joining more than 2^31 rows with data.table

I have an igraph network graph with 103,887 nodes and 4,795,466 ties.

This can be structured as an edgelist in a data.table with almost 9 million rows.

I can find the common neighbors in this network, following @chinsoon12's answer here. See the example below.

This works beautifully for smaller networks, but runs into problems in my use-case because the merge results in more than 2^31 rows.

Questions:

Example - modified from @chinsoon12's answer:

library(data.table)
library(igraph)

set.seed(1234)
g <- random.graph.game(10, p=0.10)

adjSM <- as(get.adjacency(g), "dgTMatrix")
adjDT <- data.table(V1=adjSM@i+1, V2=adjSM@j+1)

res <- adjDT[adjDT, nomatch=0, on="V2", allow.cartesian=TRUE
][V1 < i.V1, .(Neighbours=paste(V2, collapse=",")),
  by=c("V1","i.V1")][order(V1)]

res

   V1 i.V1 Neighbours
1:  4    5          8
2:  4   10          8
3:  5   10          8

Upvotes: 4

Views: 299

Answers (2)

ThomasIsCoding
ThomasIsCoding

Reputation: 102329

Update

  • If you just want to query the common neighbors, I don't suggest you build up a huge look-up table. Instead, you can use the following code to get the result for your query:
find_common_neighbors <- function(g, Vs) {
  which(colSums(distances(g, Vs) == 1) == length(Vs))
}

such that

> find_common_neighbors(g, c(4, 8))
integer(0)

> find_common_neighbors(g, c(4, 5))
[1] 8
  • If you need a look-up table, an alternative is to use Neighbours as the key to search its associated node, e.g.,
res <- transform(
  data.frame(Neighbours = which(degree(g) >= 2)),
  Nodes = sapply(
    Neighbours,
    function(x) toString(neighbors(g, x))
  )
)

Previous Answer

I think you can use ego over g directly to generate res, e.g.,

setNames(
  data.frame(
    t(do.call(
      cbind,
      lapply(
        Filter(function(x) length(x) > 2, ego(g, 1)),
        function(x) {
          rbind(combn(x[-1], 2), x[1])
        }
      )
    ))
  ),
  c("V1", "V2", "Neighbours")
)

which gives

  V1 V2 Neighbours
1  4  5          8
2  4 10          8
3  5 10          8

Upvotes: 2

Frank
Frank

Reputation: 66819

common neighbors

Can I split the data and do the computation in steps?

You can split by V1 to avoid running into the big-merge issue:

neighDT = adjDT[, if (.N > 1) {
    cb = combn(V2, 2)
    .(a = cb[1, ], b = cb[2, ])
}, by=.(neighbor = V1)]

which gives

   neighbor a  b
1:        8 4  5
2:        8 4 10
3:        8 5 10

(The OP found gRbase::combnPrim to be faster than combn here.)

How can we collapse all the common neighbors (separated with a comma) for the same combination into one observation?

neighDT_agg = neighDT[order(neighbor), 
  .(neighbors = toString(neighbor))
, keyby=.(a,b)]

The order ensures that the string is sorted alphabetically. The keyby ensures that the table is sorted by pairs {a,b} and facilitates a simple fast lookup for multiple pairs at once:

# single query
neighDT_agg[.(4,10), neighbors]
# [1] "8"

# multi query
pairs_queryDT = data.table(a = c(4,5,8), b = c(5,10,10))
neighDT_agg[pairs_queryDT, neighbors]
[1] "8" "8" NA

I have an igraph network graph with 103,887 nodes and 4,795,466 ties.

Each call to combn will be making a 2-by-choose(.N, 2) matrix. If a node is connected to all other nodes, then it is a common neighbor to all pairs of other nodes and you'll be facing choose(103887-1, 2) of these pairs. I guess this is more an issue with the way the problem is defined than with the approach to solving it.


The results will be used to query about common neighbors.

For the approach above, you'll need to compute the full neighbors table first.

If you just have a few ad hoc queries about intersecting neighbors:

find_neighbors <- function(a, b){
    adjDT[.(c(a, b)), on=.(V1), V2[duplicated(V2)]]
}

find_neighbors(4, 10)
# [1] 8

This can similarly be wrapped in toString to collapse the values.

Upvotes: 2

Related Questions