Reputation: 2645

Co-occurrences from a large dataframe

I have a dataframe with information about which cities has visited each user:

df.visited <- data.frame(user  = c("john","john", 
                                   "claire", "claire", 
                                    "doe","doe"), 
                        city = c('Antananarivo', 'Barcelona', 
                                 'Caen', 'Dijon', 
                                 'Antananarivo', 'Caen'))

I want to create a graph of co-visits. For that, I need either an adjacency matrix (users x users) or an edge list (usera, userb, #co-visits)

I can do this for small datasets:

by_user_city <- table(df.visited)    

#        city
#user     Antananarivo Barcelona Caen Dijon
#claire            0         0    1     1
#doe               1         0    1     0
#john              1         1    0     0

adjacency <- by_user_city %*% t(by_user_city)

#     user 
#user     claire doe john
#claire      2   1    0
#doe         1   2    1
#john        0   1    2

edges <- melt(adjacency)

#    user   user value
#1 claire claire     2
#2    doe claire     1
#3   john claire     0
#4 claire    doe     1
#5    doe    doe     2
#6   john    doe     1
#7 claire   john     0
#8    doe   john     1
#9   john   john     2

For a large dataset with a log of 1.5M visits of more than 300,000 users, the table command complains:

Error in table(df.visited) : 
  attempt to make a table with >= 2^31 elements

So, how can I get the co-visit edges without running out of memory?

Upvotes: 3

Answers (3)

lukeA

Reputation: 54237

Speaking of igraph - maybe try:

library(igraph)
g <- graph_from_data_frame(df.visited)
V(g)$type <- bipartite.mapping(g)$type
g2 <- bipartite.projection(g)$proj1
as_data_frame(g2, "edges") %>% head
#     from  to weight
# 1   john doe      1
# 2 claire doe      1

g2 is the graph that you are probably (?) looking for.

Upvotes: 2

Hack-R

Reputation: 23200

Given the size of your data I suggest that you use the Java graph database neo4j. Former neo4j employee Nicole White made an R package for it, RNeo4j. I did this in 2014 to set up a lot of real time analytics on a very large company social network and it worked quite well.

You might also be able to make it work with some other graph database, but this is the one I know of and I think it's probably the most popular.

Here are the steps, as I see them:

Download Neo4j
install.packages("RNeo4j")
Connect: graph = startGraph("http://localhost:7474/db/data/")
Use the transactional endpoint to load the data
Query the results using Cypher

If you want more clarity on #4 and #5 there's an old post where someone asked how to scale up loading data into neo4j with R where White answered with examples of how to use the transactional endpoint and query the results. Of course, you could also load it outside of R if you wanted to.

This also solves the many future problems you may have with how to visualize the social graph, do all sorts of different queries your network/forum, deal with increasing size, etc, etc. You shouldn't run into memory problems this way as it's really well-designed for scale.

You can use graphing packages like igraph and ggnet with it, keeping the memory-intensive parts in the graph database:

library(igraph)

query = "
MATCH (n)-->(m)
RETURN n.name, m.name
"

edgelist = cypher(graph, query)
ig = graph.data.frame(edgelist, directed=F)

betweenness(ig)

plot(ig)

Upvotes: 3

Shenglin Chen

Reputation: 4554

Try this to avoid table function.

library(tidyr)
df.visited$val<-1
spread(df.visited,city,val,fill=0)

Upvotes: 2

Co-occurrences from a large dataframe

Answers (3)

Related Questions