Helen
Helen

Reputation: 75

Filter pairs of elements so that each element occurs in only one pair

I have a data frame containing pairs of interacting elements. I want to filter the pairs so that each element occurs in only one pair. Which pair is kept for each element should be randomly selected.

df <- data.frame(element1=c("A","A","B","B","B","C","C","D"), 
                 element2=c("B","C","D","E","C","D","E","F"),
                 stringsAsFactors = FALSE)

Example input:

Element1   Element2   
A          B        
A          C   
B          D       
B          E
B          C
C          D
C          E 
D          F       

Desired output e.g.:

Element1   Element2   
A          B
C          E
D          F

Thank you for your help.

Upvotes: 1

Views: 145

Answers (3)

chinsoon12
chinsoon12

Reputation: 25225

For this toy example, you can approach it as an optimization problem with binary variables representing which rows are chosen, and constraints such that each variable is chosen once.

el <- sort(unique(c(df$element1, df$element2)))
A <- matrix(0L, nrow=length(el), ncol=nrow(df), dimnames=list(el, sample(nrow(df))))
i <- match(unlist(df), rownames(A))
j <- match(rep(1L:nrow(df), ncol(df)), colnames(A))
A[cbind(i, j)] <- 1L

ans <- lp(direction="min",
    objective.in=rep(1L, nrow(df)),
    const.mat=A,
    const.dir="==",
    const.rhs=rep(1L, length(el)),
    all.bin=TRUE)
df[colnames(A)[as.logical(ans$solution)], ]

output:

  element1 element2
2        A        C
4        B        E
8        D        F

Upvotes: 0

Axeman
Axeman

Reputation: 35382

A simple igraph method, that isn't smart about keeping the most data. But it will guarantee a maximum of one connection per element.

library(igraph)
gr <- graph_from_data_frame(df, directed = FALSE)

for (vertex in sample(V(gr))) {
  if (degree(gr)[vertex] > 1) {
    edges <- E(gr)[.inc(vertex)]
    to_remove <- sample(edges, length(edges) - 1)
    gr <- delete_edges(gr, to_remove)
  }
}
as_data_frame(gr)

You can maybe do a bit better when ordering the vertices beforehand.

Upvotes: 3

RD_
RD_

Reputation: 324

I am not sure if this exactly answers your question, but here's a way to choose the first occurrence of each element. It keeps the uniqueness that you mentioned but it is not randomly chosen.

    new_e1 <- c()
    new_e2 <- c()

    for(i in 1:nrow(df)) {
        e1 <- df[i, "element1"]
        e2 <- df[i, "element2"]
        if (!(e1 %in% new_e1) && !(e2 %in% new_e2) &&
            !(e1 %in% new_e2) && !(e2 %in% new_e1)) {
                new_e1 <- c(new_e1, e1)
                new_e2 <- c(new_e2, e2)
        }
    }

    new_df <- data.frame(element1 = new_e1, element2 = new_e2, stringsAsFactors = FALSE)

I am sure that there are smarter ways to write this code, but I hope this helps.

Upvotes: 1

Related Questions