Reputation: 581
I am trying to generate a unique ID column using the RecordLinkage package. I have successfully done so when working with smaller datasets (<= 1,000,000), but have not been able to reproduce this result for larger datasets (> 1,000,000) that use different (but similar) functions in the package. I am given multiple identifier variables for which I want to generate a unique ID despite the fact that there may be some errors (near matches) or duplicates in the records.
Given some data frame of identifiers:
data(RLdata500)
df_identifiers <- RLdata500
This is the code for the smaller datesets (which work):
df_identifiers <- df_identifiers %>% mutate(ID = 1:nrow(df_identifiers))
rpairs <- compare.dedup(df_identifiers)
p=epiWeights(rpairs)
classify <- epiClassify(p,0.3)
matches <- getPairs(object = classify, show = "links", single.rows = TRUE)
# this code writes an "ID" column that is the same for similar identifiers
classify <- matches %>% arrange(ID.1) %>% filter(!duplicated(ID.2))
df_identifiers$ID_prior <- df_identifiers$ID
# merge matching information with the original data
df_identifiers <- left_join(df_identifiers, matches %>% select(ID.1,ID.2), by=c("ID"="ID.2"))
# replace matches in ID with the thing they match with from ID.1
df_identifiers$ID <- ifelse(is.na(df_identifiers$ID.1), df_identifiers$ID, df_identifiers$ID.1)
This approach is discussed here. But this code does not seem to be extensible when applied towards larger datasets when using other functions. For example, the big data equivalent of compare.dedup
is RLBigDataDedup
, whose RLBigData
class support similar functions such as epiWeights
, epiClassify
, getPairs
, etc. Replacing compare.dedup
with RLBigDataDedup
does not work in this situation.
Consider the following attempt for large datasets:
df_identifiers <- df_identifiers %>% mutate(ID = 1:nrow(df_identifiers))
rpairs <- RLBigDataDedup(df_identifiers)
p=epiWeights(rpairs)
( . . . )
Here, the remaining code is almost identical to that of the first. Although epiWeights
and epiClassify
work on the RLBigData
class as expected, getPairs
does not. The function getPairs
does not use the show = "links"
argument. Because of this, all subsequent code does not work.
Is there a different approach that needs to be taken to generate a column of unique IDs when working with larger datasets in the RLBigData
class, or is this just a limitation?
Upvotes: 4
Views: 911
Reputation: 581
First, import the following libraries:
library(RecordLinkage)
library(dplyr)
library(magrittr)
Consider these example datasets from the RecordLinkage package:
data(RLdata500)
data(RLdata10000)
Assume we care about these matching variables and threshold:
matching_variables <- c("fname_c1", "lname_c1", "by", "bm", "bd")
threshold <- 0.5
The record linkage for SMALL datasets is as follows:
RLdata <- RLdata500
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
compare.dedup() %>%
epiWeights() %>%
epiClassify(threshold) %>%
getPairs(show = "links", single.rows = TRUE) -> matching_data
Here, the following SMALL data manipulation may be applied to append the appropriate IDs to the given dataset (same code from here):
RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
select(matching_data, id1, id2) %>%
arrange(id1) %>% filter(!duplicated(id2)),
by = c("ID" = "id2")) %>%
mutate(ID = ifelse(is.na(id1), ID, id1)) %>%
select(-id1)
RLdata$ID <- RLdata_ID$ID
The equivalent code for LARGE datasets is as follows:
RLdata <- RLdata10000
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
RLBigDataDedup() %>%
epiWeights() %>%
epiClassify(threshold) %>%
getPairs(filter.link = "link", single.rows = TRUE) -> matching_data
Here, the following LARGE data manipulation may be applied to append the appropriate IDs to the given dataset (similar to code from here):
RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
select(matching_data, id.1, id.2) %>%
arrange(id.1) %>% filter(!duplicated(id.2)),
by = c("ID" = "id.2")) %>%
mutate(ID = ifelse(is.na(id.1), ID, id.1)) %>%
select(-id.1)
RLdata$ID <- RLdata_ID$ID
Upvotes: 2