Reputation: 2645
I have a dataframe with information about which cities has visited each user:
df.visited <- data.frame(user = c("john","john",
"claire", "claire",
"doe","doe"),
city = c('Antananarivo', 'Barcelona',
'Caen', 'Dijon',
'Antananarivo', 'Caen'))
I want to create a graph of co-visits. For that, I need either an adjacency matrix (users x users) or an edge list (usera, userb, #co-visits)
I can do this for small datasets:
by_user_city <- table(df.visited)
# city
#user Antananarivo Barcelona Caen Dijon
#claire 0 0 1 1
#doe 1 0 1 0
#john 1 1 0 0
adjacency <- by_user_city %*% t(by_user_city)
# user
#user claire doe john
#claire 2 1 0
#doe 1 2 1
#john 0 1 2
edges <- melt(adjacency)
# user user value
#1 claire claire 2
#2 doe claire 1
#3 john claire 0
#4 claire doe 1
#5 doe doe 2
#6 john doe 1
#7 claire john 0
#8 doe john 1
#9 john john 2
For a large dataset with a log of 1.5M visits of more than 300,000 users, the table command complains:
Error in table(df.visited) :
attempt to make a table with >= 2^31 elements
So, how can I get the co-visit edges without running out of memory?
Upvotes: 3
Views: 341
Reputation: 54237
Speaking of igraph
- maybe try:
library(igraph)
g <- graph_from_data_frame(df.visited)
V(g)$type <- bipartite.mapping(g)$type
g2 <- bipartite.projection(g)$proj1
as_data_frame(g2, "edges") %>% head
# from to weight
# 1 john doe 1
# 2 claire doe 1
g2
is the graph that you are probably (?) looking for.
Upvotes: 2
Reputation: 23200
Given the size of your data I suggest that you use the Java graph database neo4j. Former neo4j employee Nicole White made an R package for it, RNeo4j. I did this in 2014 to set up a lot of real time analytics on a very large company social network and it worked quite well.
You might also be able to make it work with some other graph database, but this is the one I know of and I think it's probably the most popular.
Here are the steps, as I see them:
install.packages("RNeo4j")
graph = startGraph("http://localhost:7474/db/data/")
If you want more clarity on #4 and #5 there's an old post where someone asked how to scale up loading data into neo4j with R where White answered with examples of how to use the transactional endpoint and query the results. Of course, you could also load it outside of R if you wanted to.
This also solves the many future problems you may have with how to visualize the social graph, do all sorts of different queries your network/forum, deal with increasing size, etc, etc. You shouldn't run into memory problems this way as it's really well-designed for scale.
You can use graphing packages like igraph
and ggnet
with it, keeping the memory-intensive parts in the graph database:
library(igraph)
query = "
MATCH (n)-->(m)
RETURN n.name, m.name
"
edgelist = cypher(graph, query)
ig = graph.data.frame(edgelist, directed=F)
betweenness(ig)
plot(ig)
Upvotes: 3
Reputation: 4554
Try this to avoid table function.
library(tidyr)
df.visited$val<-1
spread(df.visited,city,val,fill=0)
Upvotes: 2