Reputation: 222
Suppose I have a matrix of distance cost, in which the cost of destiny and the cost of origin both need to be below a certain threshold amount - say, US 100 -- to share a link. My difficulty lies in achieving a common set after classifying these localities: A1 links (cost of destiny and origin below threshold) with A2 and (same thing) A3 and A4; A2 links with A1 and A4; A4 links with A1 and A2. So A1, A2 and A4 would be classified in the same group, as being the group with highest frequency of links between themselves. Below I set a matrix as an example:
A1 A2 A3 A4 A5 A6 A7
A1 0 90 90 90 100 100 100
A2 80 0 90 90 90 110 100
A3 80 110 0 90 120 110 90
A4 90 90 110 0 90 100 90
A5 110 110 110 110 0 90 80
A6 120 130 135 100 90 0 90
A7 105 110 120 90 90 90 0
I am programming this with Stata and I haven't placed the matrix above in matrix form, as in mata
. The column listing the letters A plus the number is a variable with the rownames of the matrix and the rest of the columns are named with each locality name (e.g. A1 and so on).
I have returned the list of links between each locality with the following code, which maybe I did it very "bruteforcelly" since I was in a hurry:
clear all
set more off
//inputting matrix
input A1 A2 A3 A4 A5 A6 A7
0 90 90 90 100 100 100
80 0 90 90 90 100 100
80 110 0 90 120 110 90
90 90 110 0 90 100 90
110 110 110 110 0 90 90
120 130 135 100 90 0 90
105 110 120 90 90 90 0
end
//generate row variable
gen locality=""
forv i=1/7{
replace locality="A`i'" in `i'
}
*
order locality, first
//generating who gets below the threshold of 100
forv i=1/7{
gen r_`i'=0
replace r_`i'=1 if A`i'<100 & A`i'!=0
}
*
//checking if both ways (origin and destiny below threshold)
forv i=1/7{
gen check_`i'=.
forv j=1/7{
local v=r_`i'[`j']
local vv=r_`j'[`i']
replace check_`i'=`v'+`vv' in `j'
}
*
}
*
//creating list of links
gen locality_x=""
forv i=1/7{
preserve
local name = locality[`i']
keep if check_`i'==2
replace locality_x="`name'"
keep locality locality_x
save "C:\Users\user\Desktop\temp_`i'", replace
restore
}
*
use "C:\Users\user\Desktop\temp_1", clear
forv i=2/7{
append using "C:\Users\user\Desktop\temp_`i'"
}
*
//now locality_x lists if A.1 has links with A.2, A.3 etc. and so on.
//the dificulty lies in finding a common intersection between the groups.
Which returns the following listing:
locality_x locality
A1 A2
A1 A3
A1 A4
A2 A1
A2 A4
A3 A1
A4 A1
A4 A2
A4 A7
A5 A6
A5 A7
A6 A5
A6 A7
A7 A4
A7 A5
A7 A6
I am trying to get familiar with set-intersection, but I haven't a clue of how to do this in Stata. I want to do something in which I could reprogram the threshold and find the common set. I would be thankful if you could produce a solution in R, given that I can program a bit in it.
A similar way of obtaining the list in R (as @user2957945 put in his answer below):
structure(c(0L, 80L, 80L, 90L, 110L, 120L, 105L, 90L, 0L, 110L,
90L, 110L, 130L, 110L, 90L, 90L, 0L, 110L, 110L, 135L, 120L,
90L, 90L, 90L, 0L, 110L, 100L, 90L, 100L, 90L, 120L, 90L, 0L,
90L, 90L, 100L, 110L, 110L, 100L, 90L, 0L, 90L, 100L, 100L, 90L,
90L, 80L, 90L, 0L), .Dim = c(7L, 7L), .Dimnames = list(c("A1",
"A2", "A3", "A4", "A5", "A6", "A7"), c("A1", "A2", "A3", "A4",
"A5", "A6", "A7")))
# get values less than threshold
id = m < 100
# make sure both values are less than threshold, and dont include diagonal
m_new = (id + t(id) == 2) & m !=0
# melt data and subset to keep TRUE values (TRUE if both less than threshold and not on diagonal)
result = subset(reshape2::melt(m_new), value)
# reorder to match question results , if needed
result[order(result[[1]], result[[2]]), 1:2]
Var1 Var2
8 A1 A2
15 A1 A3
22 A1 A4
2 A2 A1
23 A2 A4
3 A3 A1
4 A4 A1
11 A4 A2
46 A4 A7
40 A5 A6
47 A5 A7
34 A6 A5
48 A6 A7
28 A7 A4
35 A7 A5
42 A7 A6
I'm also adding the "graph theory" tag since I believe this is not exactly a intersection problem, in which I could transform the list in vectors and use the intersect
function in R. The code needs to produce a new id in which some localities must be in the same new id (group). As in the example above, if the set of A.1 has A.2 and A.4, A.2 has A.1 and A.4 and A.4 has A.1 and A.2, these three localities must be in the same id (group). In other words, I need the biggest intersection grouping of each locality. I understand that there might problems with a different matrix, such as A.1 has A.2 and A.6, A.2 has A.1 and A.6 and A.6 has A.1 and A.2 (but A.6 does not have A.4, considering the first example above still). In that situation, I welcome a solution of adding A.6 to the grouping or some other arbitrary one, in which the code just groups the first set together, removes A.1, A.2 and A.4 from the listing, and leaves A.6 with no new grouping.
Upvotes: 4
Views: 152
Reputation: 2665
Assuming what you want are the largest complete subgraphs, you can use the igraph package:
# Load necessary libraries
library(igraph)
# Define global parameters
threshold <- 100
# Compute the adjacency matrix
# (distances in both directions need to be smaller than the threshold)
am <- m < threshold & t(m) < threshold
# Make an undirected graph given the adjacency matrix
# (we set diag to FALSE so as not to draw links from a vertex to itself)
gr <- graph_from_adjacency_matrix(am, mode = "undirected", diag = FALSE)
# Find all the largest complete subgraphs
lc <- largest_cliques(gr)
# Output the list of complete subgraphs as a list of vertex names
lapply(lc, (function (e) e$name))
As far as I know, there is no similar functionality in Stata. However, if you were looking for the largest connected subgraph (which is the whole graph in your case), then you could have used clustering commands in Stata (i.e. clustermat
).
Upvotes: 1
Reputation: 2413
In R you can do
# get values less then threshold
id = m < 100
# make sure both values are less then threshold, and dont include diagonal
m_new = (id + t(id) == 2) & m !=0
# melt data and subset to keep TRUE values (TRUE if both less than threshold and not on diagonal)
result = subset(reshape2::melt(m_new), value)
# reorder to match question results , if needed
result[order(result[[1]], result[[2]]), 1:2]
Var1 Var2
8 A1 A2
15 A1 A3
22 A1 A4
2 A2 A1
23 A2 A4
3 A3 A1
4 A4 A1
11 A4 A2
46 A4 A7
40 A5 A6
47 A5 A7
34 A6 A5
48 A6 A7
28 A7 A4
35 A7 A5
42 A7 A6
.
structure(c(0L, 80L, 80L, 90L, 110L, 120L, 105L, 90L, 0L, 110L,
90L, 110L, 130L, 110L, 90L, 90L, 0L, 110L, 110L, 135L, 120L,
90L, 90L, 90L, 0L, 110L, 100L, 90L, 100L, 90L, 120L, 90L, 0L,
90L, 90L, 100L, 110L, 110L, 100L, 90L, 0L, 90L, 100L, 100L, 90L,
90L, 80L, 90L, 0L), .Dim = c(7L, 7L), .Dimnames = list(c("A1",
"A2", "A3", "A4", "A5", "A6", "A7"), c("A1", "A2", "A3", "A4",
"A5", "A6", "A7")))
Upvotes: 2