Reputation: 75
I have a data frame containing pairs of interacting elements. I want to filter the pairs so that each element occurs in only one pair. Which pair is kept for each element should be randomly selected.
df <- data.frame(element1=c("A","A","B","B","B","C","C","D"),
element2=c("B","C","D","E","C","D","E","F"),
stringsAsFactors = FALSE)
Example input:
Element1 Element2
A B
A C
B D
B E
B C
C D
C E
D F
Desired output e.g.:
Element1 Element2
A B
C E
D F
Thank you for your help.
Upvotes: 1
Views: 145
Reputation: 25225
For this toy example, you can approach it as an optimization problem with binary variables representing which rows are chosen, and constraints such that each variable is chosen once.
el <- sort(unique(c(df$element1, df$element2)))
A <- matrix(0L, nrow=length(el), ncol=nrow(df), dimnames=list(el, sample(nrow(df))))
i <- match(unlist(df), rownames(A))
j <- match(rep(1L:nrow(df), ncol(df)), colnames(A))
A[cbind(i, j)] <- 1L
ans <- lp(direction="min",
objective.in=rep(1L, nrow(df)),
const.mat=A,
const.dir="==",
const.rhs=rep(1L, length(el)),
all.bin=TRUE)
df[colnames(A)[as.logical(ans$solution)], ]
output:
element1 element2
2 A C
4 B E
8 D F
Upvotes: 0
Reputation: 35382
A simple igraph
method, that isn't smart about keeping the most data. But it will guarantee a maximum of one connection per element.
library(igraph)
gr <- graph_from_data_frame(df, directed = FALSE)
for (vertex in sample(V(gr))) {
if (degree(gr)[vertex] > 1) {
edges <- E(gr)[.inc(vertex)]
to_remove <- sample(edges, length(edges) - 1)
gr <- delete_edges(gr, to_remove)
}
}
as_data_frame(gr)
You can maybe do a bit better when ordering the vertices beforehand.
Upvotes: 3
Reputation: 324
I am not sure if this exactly answers your question, but here's a way to choose the first occurrence of each element. It keeps the uniqueness that you mentioned but it is not randomly chosen.
new_e1 <- c()
new_e2 <- c()
for(i in 1:nrow(df)) {
e1 <- df[i, "element1"]
e2 <- df[i, "element2"]
if (!(e1 %in% new_e1) && !(e2 %in% new_e2) &&
!(e1 %in% new_e2) && !(e2 %in% new_e1)) {
new_e1 <- c(new_e1, e1)
new_e2 <- c(new_e2, e2)
}
}
new_df <- data.frame(element1 = new_e1, element2 = new_e2, stringsAsFactors = FALSE)
I am sure that there are smarter ways to write this code, but I hope this helps.
Upvotes: 1