Reputation: 49
I'm having trouble creating a directed graph (with the igraph package) from my dataset (data table of 10 columns) in R. The task is as follows: I need to build a directed (network) graph, where an individual X is connected to individual Y if X invited Y to the platform. Ultimately, I need to identify the size of the longest chain of the network and calculate the clustering coefficient.
After filtering my dt, dt.user consists of the following 2 columns: user_id, inviter_id.
user_id: user identification
inviter_id: id of the user that invited this user to the platform
After cleaning the data (removing all NA values), I'm trying to make this work, but I'm not sure if I'm doing it in the right way since my clustering coefficient is 0 (which seems very unlikely):
all.users <- dt.users[, list(inviter_id, user_id)]
g.invites.network <- graph.data.frame(all.users, directed = TRUE)
I've tried switching the direction of the connections, but I still get the same results in terms of diameter and clustering coefficient:
all.users <- dt.users[, list(user_id, inviter_id)]
My question is, is my directed graph wrong? If so, what am I doing wrong? I believe that my answer is wrong because of the clustering coefficient of 0. To me, it seems very unlikely that there seems to be no cluster forming at all in this network. And should I keep ...list(inviter_id), user_id
instead of ...list(user_id, inviter_id)
?
Sample data (40 rows):
dt.users <- data.table::data.table(
inviter_id = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 23L, 22L, 31L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 63L, 4L, 4L, 4L),
user_id = c(17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 32L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 49L, 50L, 51L, 52L, 53L, 54L, 55L, 56L, 58L, 59L, 60L, 64L, 71L, 75L, 76L, 78L)
)
Any help would be greatly appreciated!
Upvotes: 1
Views: 1396
Reputation: 37641
At least for your sample data, 0 is the correct answer and I suspect that this will always be true for your full data set because of the way that it is constructed.
I assume that when you say you are computing "clustering coefficient" that you are computing transitivity(g.invites.network)
which does give zero as the answer. According to the documentation:
This is simply the ratio of the triangles and the connected triples in the graph. For directed graph the direction of the edges is ignored.
Of course, I don't know for sure how your data was constructed, but it appears that only one individual gets "credit" for inviting any other user, that is, there are never two arrows coming in to a vertex. Assuming that is true, your data will never have any triangles. Therefore, the "ratio of the triangles and the connected triples in the graph" will have a numerator of zero and will always be zero.
This is obvious in the graph of your sample data.
plot(g.invites.network)
Addition based on comments
There are two kinds of diameter to compute - directed and undirected.
For your example data, the directed diameter is 2 and the undirected diameter is 4.
diameter(g.invites.network)
[1] 2
diameter(g.invites.network, directed=FALSE)
[1] 4
You can get the vertices that make up these paths using get_diameter
get_diameter(g.invites.network)
+ 3/43 vertices, named:
[1] 4 23 25
get_diameter(g.invites.network, directed=FALSE)
+ 5/43 vertices, named:
[1] 25 23 4 22 26
To subset the graph to get an idea of the diameters, you can use induced_subgraph
. For example, to get just those nodes:
DiamPath = get_diameter(g.invites.network, directed=FALSE)
DiameterGraph = induced_subgraph(g.invites.network, DiamPath)
plot(DiameterGraph)
Or maybe you want to look at the diameter in context, you could color the diameter vertices differently.
DiamPath = get_diameter(g.invites.network, directed=FALSE)
VC = rep("orange", vcount(g.invites.network))
VC[DiamPath] = "red"
plot(g.invites.network, vertex.color=VC)
Upvotes: 1