Find unique set of identifiers/groups among several columns

Question

I have data with two (potentially more) columns of identifiers (typically long strings). These differ sometimes, are mistyped, or change over time. I want to identify unique subjects in the data. This requires identifying groups of cases which are connected via their ids at some level.

An example

df <- data.frame(ida = c("A", "B", "C", "C", "D", "E"),
                 idb = c(1, 1, 3, 4, 4, 7),
                 trueid = c("id1", "id1", "id2", "id2", "id2", "id3"))

> df
  ida idb trueid
1   A   1    id1
2   B   1    id1
3   C   3    id2
4   C   4    id2
5   D   4    id2
6   E   7    id3

Identifiers for id1 are "A", "B", 1, for id2 "C", "D", 3, 4 and for id3 "E", 7.

I don't know trueid but need to find it using the information from columns ida and idb.

The solution needs to scale to millions of observations with tens of thousands of unique ids. I am already using data.table.

Extension: There is another scenario where there are more than two columns, and some columns might be informative for others, i.e. have the same identifiers. I do not know which columns are informative for which. I think type can be disregarded though, all columns are strings or can safely be converted.

Another example for this:

df <- data.frame(ida = c("A", "B", "C", "C", "D", "E"),
                 idb = c("1", "2", "3", "4", "4", "7"),
                 idc = c("1", "1", "2", "3", "4", "5"),
                 idd = c("1", "A", "2", "3", "4", "5"),
                 trueid = c("id1", "id1", "id1", "id1", "id1", "id2"))

> df
  ida idb idc idd trueid
1   A   1   1   1    id1
2   B   2   1   A    id1
3   C   3   2   2    id1
4   C   4   3   3    id1
5   D   4   4   4    id1
6   E   7   5   5    id2

Edit: As a commenter pointed out, this is essentially the clique problem of finding complete subgraphs in a graph. After reading a bit more, I understand this issue can be solved with library(igraph). I am leaving the question open as I would prefer a solution that relies on base, data.table or dplyr. I cannot easily install packages on the server I am using, installing igraph involves dealing with a lot of red tape and delays.

Edit2: For anybody reading this and facing a similar problem: zx8754's answer using igraph is considerably (several orders of magnitude) faster on larger (simulated) data with more groups. If you have the chance to use igraph, do so.

chinsoon12 · Accepted Answer

Here is a recursive approach using data.table:

#convert into a long format for easier processing
mDT <- melt(DT[, rn := .I], id.var="rn", variable.name="V", value.name="ID")[,
    tid := NA_integer_]

#the recursive function
link <- function(ids, label) {
    #identify the rows in DT containing ids and extract the IDs
    newids <- mDT[mDT[.(ID=ids), on=.(ID), .(rn=rn)], on=.(rn), allow.cartesian=TRUE,
        unique(ID)]

    #update those rows to the same group
    mDT[mDT[.(ID=ids), on=.(ID), .(rn=rn)], on=.(rn), tid := label]

    if (length(setdiff(newids, ids)) > 0L) {
        #call the recursive function if there are new ids
        link(newids, label)
    }
}

#get the first id that is not labelled yet
id <- mDT[is.na(tid), ID[1L]]
grp <- 1L
while(!is.na(id)) {
    #use recursive function to link them up
    link(id, grp)

    #repeat for next id that is not part of any group yet
    id <- mDT[is.na(tid), ID[1L]]
    grp <- grp + 1L
}

#update original DT with tid
DT[mDT, on=.(rn), tid := tid]

data:

library(data.table)
DT <- data.table(ida = c("A", "B", "C", "C", "D", "E"),
    idb = c("1", "2", "3", "4", "4", "7"),
    idc = c("1", "1", "2", "3", "4", "5"),
    idd = c("1", "A", "2", "3", "4", "5"))

Find unique set of identifiers/groups among several columns

Answers (2)

Related Questions