gfgm
gfgm

Reputation: 3647

create id variable from table of duplicates

I have a dataframe where each row has a unique identifier, but some rows are actually duplicates.

fdf <- data.frame(name = c("fred", "ferd", "frad", 'eric', "eirc", "george"),
                  id = 1:6)
fdf
#>     name id
#> 1   fred  1
#> 2   ferd  2
#> 3   frad  3
#> 4   eric  4
#> 5   eirc  5
#> 6 george  6

I have determined which rows are duplicated and this information is stored in a second dataframe as pairs of the unique id's. So the key tells me row 1 is the same individual as rows 2 and 3, etc.

key <- data.frame(id1 = c(1,1,2,4), id2 = c(2,3,3,5))
key
#>   id1 id2
#> 1   1   2
#> 2   1   3
#> 3   2   3
#> 4   4   5

I'm struggling to think up a straightforward way to use the key to create an id variable in my original dataframe. Desired output would be:

fdf$realid <- c(1,1,1,2,2,3)
fdf
#>     name id realid
#> 1   fred  1      1
#> 2   ferd  2      1
#> 3   frad  3      1
#> 4   eric  4      2
#> 5   eirc  5      2
#> 6 george  6      3

Edit for clarity

Keys here are the set of true connections between rows in the data.frame fdf. Thus you can imagine starting with the set of all feasible connections:

#  id1 id2
#   1   2
#   1   3
#   1   4
#   ...
#   6   4
#   6   5

determining which are true connections (based on the other variables in each observation).

#  id1 id2 match
#   1   2   match
#   1   3  no match
#   1   4   match
#   ...
#   6   4   no match
#   6   5   no match

and sub-setting to the cases that are matches.

Upvotes: 3

Views: 188

Answers (3)

Jordi
Jordi

Reputation: 1343

If I understand your question correctly it can be solved by creating groups of matching ids and creating a new (real) id out of these groups:

# determine the groups of ids
id_groups <- list()
i = 1
for (id in unique(key$id1)) {
  if (!(id %in% unlist(id_groups))) {
    id_groups[[i]] <- c(id, key$id2[key$id1 == id])
    i = i + 1
  }
}

# add ids without match
id_groups <- c(id_groups, setdiff(fdf$id, unlist(id_groups)))

# for every id in fdf, set real_id to index in id_groups to which id belongs
fdf$real_id <- sapply(fdf$id, function(id) {
  which(sapply(id_groups, function(group) id %in% group))
})

Upvotes: 1

KenHBS
KenHBS

Reputation: 7164

I didn't find a 'straight forward way', but it seems to work well.

First you check which IDs are together in a group, by checking whether there's 'overlap', i.e. whether the intersection between two rows in key is non-empty:

check_overlap <- function(pair1, pair2){
  newset <- intersect(pair1, pair2)
  length(newset) != 0
}

Then we can apply this function to the rows in key against the other rows. If a row has been matched already, it is automatically removed from key, like this:

check_overlaps <- function(key){
  cont <- data.frame()
  i <- 1
  while(nrow(key) > 0){
    ids  <- apply(key, 1, check_overlap, key[1, ])
    vals <- unique(unlist(key[ids, ]))
    key  <- key[!ids, ]
    cont <- rbind(cont, cbind(vals, rep(i, length(vals))))
    i <- i+1
  }
  return(cont)
}

new_ids <- check_overlaps(key)
#    vals V2
# 1    1  1
# 2    2  1
# 3    3  1
# 4    4  2
# 5    5  2

The problem with merging fdf and new_ids, however, is that some old IDs may not occur in key, but they should be mapped to a new ID according to the new order. You can manipulate key a bit a priori and do:

for(val in unique(fdf$id)){
  if(!(val %in% unlist(key))){
    key <- rbind(key, c(val, val))
  }
}

new_ids2 <- check_overlaps(key)
     vals V2
# 1    1  1
# 2    2  1
# 3    3  1
# 4    4  2
# 5    5  2
# 6    6  3

Which is easy to merge with fdf like:

merge(fdf, new_ids2, by.x = "id", by.y = "vals")
    id   name V2
# 1  1   fred  1
# 2  2   ferd  1
# 3  3   frad  1
# 4  4   eric  2
# 5  5   eirc  2
# 6  6 george  3

Upvotes: 1

jbfern
jbfern

Reputation: 41

The easiest way would be to recreate the key data frame to the following format (i.e. which id belongs to which realid)

key <- data.frame(id     = c(1, 2, 3, 4, 5, 6), 
                  realid = c(1, 1, 1, 2, 2, 3))

Then it is just a matter of merging fdf and key together with merge

fdf <- merge(fdf, key_table, by.x = "id")
fdf
  id   name realid
1  1   fred      1
2  2   ferd      1
3  3   frad      1
4  4   eric      2
5  5   eirc      2
6  6 george      3

Upvotes: 2

Related Questions