Joseph
Joseph

Reputation: 581

Generating a unique ID column for large dataset with the RecordLinkage package

I am trying to generate a unique ID column using the RecordLinkage package. I have successfully done so when working with smaller datasets (<= 1,000,000), but have not been able to reproduce this result for larger datasets (> 1,000,000) that use different (but similar) functions in the package. I am given multiple identifier variables for which I want to generate a unique ID despite the fact that there may be some errors (near matches) or duplicates in the records.

Given some data frame of identifiers:

data(RLdata500)
df_identifiers <- RLdata500

This is the code for the smaller datesets (which work):

df_identifiers <- df_identifiers %>% mutate(ID = 1:nrow(df_identifiers))
rpairs <- compare.dedup(df_identifiers)
p=epiWeights(rpairs)
classify <- epiClassify(p,0.3)
matches <- getPairs(object = classify, show = "links", single.rows = TRUE)

# this code writes an "ID" column that is the same for similar identifiers
classify <- matches %>% arrange(ID.1) %>% filter(!duplicated(ID.2))
df_identifiers$ID_prior <- df_identifiers$ID

# merge matching information with the original data
df_identifiers <- left_join(df_identifiers, matches %>% select(ID.1,ID.2), by=c("ID"="ID.2"))

# replace matches in ID with the thing they match with from ID.1
df_identifiers$ID <- ifelse(is.na(df_identifiers$ID.1), df_identifiers$ID, df_identifiers$ID.1)

This approach is discussed here. But this code does not seem to be extensible when applied towards larger datasets when using other functions. For example, the big data equivalent of compare.dedup is RLBigDataDedup, whose RLBigData class support similar functions such as epiWeights, epiClassify, getPairs, etc. Replacing compare.dedup with RLBigDataDedup does not work in this situation.

Consider the following attempt for large datasets:

df_identifiers <- df_identifiers %>% mutate(ID = 1:nrow(df_identifiers))
rpairs <- RLBigDataDedup(df_identifiers)
p=epiWeights(rpairs)
( . . . )

Here, the remaining code is almost identical to that of the first. Although epiWeights and epiClassify work on the RLBigData class as expected, getPairs does not. The function getPairs does not use the show = "links" argument. Because of this, all subsequent code does not work.

Is there a different approach that needs to be taken to generate a column of unique IDs when working with larger datasets in the RLBigData class, or is this just a limitation?

Upvotes: 4

Views: 911

Answers (1)

Joseph
Joseph

Reputation: 581

First, import the following libraries:

library(RecordLinkage)
library(dplyr)
library(magrittr)

Consider these example datasets from the RecordLinkage package:

data(RLdata500)
data(RLdata10000)

Assume we care about these matching variables and threshold:

matching_variables <- c("fname_c1", "lname_c1", "by", "bm", "bd")
threshold <- 0.5

The record linkage for SMALL datasets is as follows:

RLdata <- RLdata500
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
  compare.dedup() %>%
  epiWeights() %>%
  epiClassify(threshold) %>%
  getPairs(show = "links", single.rows = TRUE) -> matching_data

Here, the following SMALL data manipulation may be applied to append the appropriate IDs to the given dataset (same code from here):

RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
                       select(matching_data, id1, id2) %>%
                         arrange(id1) %>% filter(!duplicated(id2)),
                       by = c("ID" = "id2")) %>%
  mutate(ID = ifelse(is.na(id1), ID, id1)) %>%
  select(-id1)
RLdata$ID <- RLdata_ID$ID

The equivalent code for LARGE datasets is as follows:

RLdata <- RLdata10000
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
  RLBigDataDedup() %>%
  epiWeights() %>%
  epiClassify(threshold) %>%
  getPairs(filter.link = "link", single.rows = TRUE) -> matching_data

Here, the following LARGE data manipulation may be applied to append the appropriate IDs to the given dataset (similar to code from here):

RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
                       select(matching_data, id.1, id.2) %>%
                         arrange(id.1) %>% filter(!duplicated(id.2)),
                       by = c("ID" = "id.2")) %>%
  mutate(ID = ifelse(is.na(id.1), ID, id.1)) %>%
  select(-id.1)
RLdata$ID <- RLdata_ID$ID

Upvotes: 2

Related Questions