Reputation: 21
I have a data set of ~5mm rows of businesses with contact information (ID(int), Email(text), BusinessPhone(text), WorkPhone(text), CellPhone(text)) - over 3 million of these rows contain duplicate data. But the dupes aren't exact dupes - for example, there may be phone numbers that are in multiple rows with different email addresses.
I want to get each row to have unique information so that I have no duplicate phone numbers or emails in my data file. I plan to do this by aggregating the columns into lists by a new column. This column (we'll call ROWIDs) should be a concatenation of all the IDs that contain 1 or more of the contact data points (Email, BusinessPhone, WorkPhone, CellPhone) that appear in that row.
I have written code that works for small sample sizes but I cannot figure out how to scale it.
Note: rows with missing phone number (any of the 3 cols) have an 'NA' text placeholder. Row example:
before:
ID Email BusinessPhone WorkPhone CellPhone
1 [email protected] 5555555555 NA 9998887777
2 NA 5555555555 873998898 NA
Desired After:
ID Email BusinessPhone WorkPhone CellPhone ROWIDs
1 [email protected] 5555555555 NA 9998887777 1,2
2 NA 5555555555 873998898 NA 1,2
library(foreach)
library(doParallel)
registerDoParallel(cores = detectCores())
getDoParWorkers() # (4)
#################### GET MATCHING ROW IDs ######################################################
d = ID(int), Email(char), BusinessPhone(char), WorkPhone(char), CellPhone(char), ROWIDs(all values are '0', col is character vector)
row = function(d) {
foreach(i = 1:nrow(d),.packages = c('dplyr','data.table','tidyverse')) %dopar% {
# 1. IDENTIFY LIST OF ROW IDS THAT CONTAIN CONTACT DATA FROM THE GIVEN ROW:
rowIDList = d[(emailp == emailp[i] & emailp != '') |
(BusinessPhone %in% c(BusinessPhone[i],WorkPhone[i],CellPhone[i]) & BusinessPhone != 0) |
(WorkPhone %in% c(BusinessPhone[i],WorkPhone[i],CellPhone[i]) & WorkPhone != 0) |
(CellPhone %in% c(BusinessPhone[i],WorkPhone[i],CellPhone[i]) & CellPhone != 0),
paste(ID, sep = ',')] %>% as.integer()
# 2. GET THE ROW IDS INTO CHARACTER FORM TO LIST THEM IN THE NEW COLUMN:
rowIDs = paste(rowIDList, collapse = ',') %>% as.character()
# 3. EDIT THE NEW COLUMN FOR THE SUBSET OF ROWS THAT CONTAIN DATA FROM THE ROW IN THE LOOP ITERATION:
d[ID %in% rowIDList,
try := rep(rowIDs,
length(rowIDList))]
}
}
For a random sample of 1000 rows, system.time follows: User: 0.75 system: 0.12 Elapsed: 1.35
This increases exponentially at 5,000 rows: user: 12.55 system: 1.50 elapsed: 16.72
And 10,000: user: 50.97 system: 16.77 elapsed: 71.88
This came a long way from where I started but this is as far as I can take it with my current skills. Any help or guidance is appreciated.
Upvotes: 2
Views: 138
Reputation: 25225
Not sure if this is fast enough for your dataset, you can use igraph
to identify your clusters of id that refer to the same person:
library(igraph)
edges <- melt(DT[, (names(DT)) := lapply(.SD, as.character)], id.vars="ID", na.rm=TRUE)[,
if (.N > 1L) transpose(combn(ID, 2L, simplify=FALSE)), value][, (1) := NULL]
g <- graph_from_data_frame(edges, FALSE)
mem <- setDT(stack(clusters(g)$membership))[, ROWIDs := toString(ind), values]
DT[mem, on=.(ID=ind), ROWIDs := ROWIDs]
DT
output:
ID Email BusinessPhone WorkPhone CellPhone ROWIDs
1: 1 [email protected] 5 <NA> 7 1, 2
2: 2 <NA> 5 6 <NA> 1, 2
3: 3 [email protected] <NA> <NA> <NA> 3, 4
4: 4 [email protected] <NA> 1 <NA> 3, 4
5: 5 [email protected] <NA> 3 <NA> 5, 6
6: 6 <NA> <NA> 3 <NA> 5, 6
7: 7 [email protected] <NA> <NA> 4 7, 8
8: 8 <NA> <NA> <NA> 4 7, 8
data:
library(data.table)
DT <- fread("
ID Email BusinessPhone WorkPhone CellPhone
1 [email protected] 5 NA 7
2 NA 5 6 NA
3 [email protected] NA NA NA
4 [email protected] NA 1 NA
5 [email protected] NA 3 NA
6 NA NA 3 NA
7 [email protected] NA NA 4
8 NA NA NA 4
")
Upvotes: 1