Reputation: 938
I have the following network in R composed of the nodes:
"39336" "19054" "32644" "52356" "14095" "18221" "12237" "61278" "34703" "15780" "33148" "54104" "5816" "92819" "4"
and the following list described all paths that end at the node '4'
p
[[1]]
[1] 52356 61278 19054 15780 19054 61278 19054
[[2]]
[1] 15780 19054 32644 14095 12237 19054 14095
[[3]]
[1] 32644 15780 19054 32644 12237 19054
[[4]]
[1] 19054 52356
[[5]]
[1] 19054 15780 19054 52356 61278 32644 34703 39336
[[6]]
[1] 39336 61278
[[7]]
[1] 19054 52356 61278 32644 34703 61278 18221
[[8]]
[1] 32644 18221 14095 32644 15780 39336
[[9]]
[1] 33148 18221 33148 14095 32644 12237 32644 61278
[[10]]
[1] 12237 14095 52356 12237 39336 61278
[[11]]
[1] 15780 34703 15780 34703 15780 19054
[[12]]
[1] 12237 52356 61278 12237 39336 19054 61278
[[13]]
[1] 52356 54104 32644 19054 61278 19054
[[14]]
[1] 54104 39336 61278 19054 61278 32644 39336
[[15]]
[1] 5816 54104 32644 52356 19054 52356
[[16]]
[1] 5816 19054 39336
[[17]]
[1] 19054 54104 5816 19054 52356 19054
Each sub-list describes a path, starting from the first element, and end ending at '4'. For example, the 4th path starts at node 19054, goes to 52356, then ends up at 4.
What i want to do is to capture the proportion of times a node is involved in a path starting from a given starting node.
For example, if we look at the nodes that started a path, we have:
rapply(p, function(x) { head(x, 1)})
5816 12237 15780 19054 32644 33148 39336 52356 54104
2 2 2 4 2 1 1 2 1
so, for that path that was started by node 54104, I want to award all involved nodes a score of '1'. In other words, I want to derive a table like:
where I have used the notation ( n(i,j,X) ) to mean the number of paths that started at i, ended in X and involved j. I have the following attempt:
m <-matrix(0, nrow = 14, ncol= 14)
for(path in 1:length(p)){
path <- 1
verticesofPath <- as.integer(p[[path]])
for (i in 2:length(verticesofPath)){
m[verticesofPath[1], verticesofPath[i]] <- m[verticesofPath[1], verticesofPath[i]] + 1
}
}
the error here is that node ids are integers, and so I can't put them in a 15x15 matrix using the id's as a reference. How do I map the ID's to integers 1-15 so that I can keep track of which nodes took place in each path, and be able to map back and give rownames/colnames to the matrix which are the initial node ids?
Upvotes: 0
Views: 85
Reputation: 35314
I think a logical approach would be to use the id value's index into a canonical id vector as the row and column indexes of the result matrix.
The match()
function is helpful here. You can match node ids into the canonical id vector to retrieve their index in that vector, which can then be used as the row or column index into the result matrix. To map backwards, you could simply index the canonical id vector with the row or column index as the vector subscript to retrieve the original node id.
Here's how this can be done:
m <- matrix(0L,length(ids),length(ids),dimnames=list(from=ids,involved=ids));
for (pi in seq_along(p)) { ## iterate over all paths; pi is the path index into p
involved <- p[[pi]][-1L]; ## get the subvector of node ids involved in (but not starting) the path
involvedUnique <- unique(involved); ## get the unique involved ids in occurrence order
involvedCount <- tabulate(match(involved,involvedUnique)); ## get their counts
ri <- match(p[[pi]][1L],ids); ## compute the implicit row index of the starting node
cis <- match(involvedUnique,ids); ## compute the implicit column indexes of the involved nodes
m[ri,cis] <- m[ri,cis]+involvedCount; ## accrue the counts onto the result matrix
}; ## end for
m;
## involved
## from 39336 19054 32644 52356 14095 18221 12237 61278 34703 15780 33148 54104 5816 92819 4
## 39336 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## 19054 1 3 2 4 0 1 0 3 2 1 0 1 1 0 0
## 32644 1 2 2 0 1 1 1 0 0 2 0 0 0 0 0
## 52356 0 5 1 0 0 0 0 3 0 1 0 1 0 0 0
## 14095 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18221 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 12237 2 1 0 2 1 0 2 3 0 0 0 0 0 0 0
## 61278 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 34703 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 15780 0 3 1 0 2 0 1 0 2 2 0 0 0 0 0
## 33148 0 0 2 0 1 1 1 1 0 0 1 0 0 0 0
## 54104 2 1 1 0 0 0 0 2 0 0 0 0 0 0 0
## 5816 1 2 1 2 0 0 0 0 0 0 0 1 0 0 0
## 92819 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Data
ids <- c(39336L,19054L,32644L,52356L,14095L,18221L,12237L,61278L,34703L,15780L,33148L,54104L,
5816L,92819L,4L);
p <- list(c(52356L,61278L,19054L,15780L,19054L,61278L,19054L),c(15780L,19054L,32644L,14095L,
12237L,19054L,14095L),c(32644L,15780L,19054L,32644L,12237L,19054L),c(19054L,52356L),c(19054L,
15780L,19054L,52356L,61278L,32644L,34703L,39336L),c(39336L,61278L),c(19054L,52356L,61278L,
32644L,34703L,61278L,18221L),c(32644L,18221L,14095L,32644L,15780L,39336L),c(33148L,18221L,
33148L,14095L,32644L,12237L,32644L,61278L),c(12237L,14095L,52356L,12237L,39336L,61278L),c(
15780L,34703L,15780L,34703L,15780L,19054L),c(12237L,52356L,61278L,12237L,39336L,19054L,61278L
),c(52356L,54104L,32644L,19054L,61278L,19054L),c(54104L,39336L,61278L,19054L,61278L,32644L,
39336L),c(5816L,54104L,32644L,52356L,19054L,52356L),c(5816L,19054L,39336L),c(19054L,54104L,
5816L,19054L,52356L,19054L));
I should clarify that I added dimension names and labels to the result matrix purely for aesthetic purposes. It is of course possible to use dimension names as subscripts in indexing operations, and thus it would be possible to use the id values themselves as subscripts by coercing them to character values when indexing, but I recommend against this. I don't think it's a very clean approach to use as.character()
coercions everywhere, and it could lead to tricky complications. For example, if you were to find yourself in a situation with pre-stringified id values that may have slightly different string representations, such as extraneous whitespace, digit group separators, or leading zeroes, then it could cause index failures.
Upvotes: 1