Reputation: 89
I got stuck on some string similarity issues.
This is how my data looks like (the original data is huge):
SerialNumber SubSerialID Date
AGCC0775CFNDA1040TMT775 AVCC0775CFNDA1040 2018/01/08
AGCC0775CFNDA1040 AVCC0775CFNDA1040 2015/12/28
AGCC0775CFNDA10407EC AVCC0775CFNDA1040 2018/03/17
CH~MT765E~C0765HFNCC1056 BGDC0865HFNKG1043 2019/01/07
2658358 BGDC0865HFNKG1043 2018/08/09
MT765E~C0765KFNCD1044 C0765KFNCD10 2015/04/07
187A126 C0765KFNCD10 2017/11/31
...
My target is:
SerialNumber SubSerialID Date
AGCC0775CFNDA10407EC AVCC0775CFNDA1040 2018/03/17
CH~MT765E~C0765HFNCC1056 BGDC0865HFNKG1043 2019/01/07
2658358 BGDC0865HFNKG1043 2018/08/09
MT765E~C0765KFNCD1044 C0765KFNCD10 2015/04/07
187A126 C0765KFNCD10 2017/11/31
...
Serial Numbers AGCC0775CFNDA1040TMT775
, AGCC0775CFNDA1040
, and AGCC0775CFNDA10407EC
are the same thing but caused by mistakes. I want to keep AGCC0775CFNDA10407EC
because it has the latest date on record. However, I cannot use SubSerialID
and Date
directly to filter those Serial Numbers because if will remove 2658358
.
I thought about using stringdist
to find string similarity as another condition (i.e., filter out by abs (similarity) >1.5 and abs (similarity)<0.5) but cannot figure out an efficient way to handle it. The data it's huge and using for loop is unpragmatic. I got stuck for a while and hopefully, someone can give me some advice or suggestion with this matter.
Upvotes: 3
Views: 169
Reputation: 50668
The following reproduces your expected output
library(dplyr)
library(purrr)
df %>%
mutate(Date = as.Date(Date)) %>%
mutate_if(is.factor, as.character) %>%
mutate(dist = map2_dbl(SerialNumber, SubSerialID, adist)) %>%
group_by(SubSerialID) %>%
filter(all(dist > 5) | Date == max(Date)) %>%
ungroup()
## A tibble: 5 x 4
# SerialNumber SubSerialID Date dist
# <chr> <chr> <date> <dbl>
#1 AGCC0775CFNDA10407EC AVCC0775CFNDA1040 2018-03-17 4
#2 CH~MT765E~C0765HFNCC1056 BGDC0865HFNKG1043 2019-01-07 15
#3 2658358 BGDC0865HFNKG1043 2018-08-09 15
#4 MT765E~C0765KFNCD1044 C0765KFNCD10 2015-04-07 9
#5 187A126 C0765KFNCD10 2017-11-30 11
The idea is to keep all entries (per SubSerialID
) if all Levenshtein distances between SubserialID
and SerialNumber
are greater than 5. If there is one distance <= 5
, only keep the row with the largest Date
. I've kept the dist
column for debugging; you can remove the column with select(-dist)
.
I'm not sure how generalisable this is. You will have to play around with the Levenshtein distance threshold (which I set to 5
in this case).
df <- read.table(text =
"SerialNumber SubSerialID Date
AGCC0775CFNDA1040TMT775 AVCC0775CFNDA1040 2018/01/08
AGCC0775CFNDA1040 AVCC0775CFNDA1040 2015/12/28
AGCC0775CFNDA10407EC AVCC0775CFNDA1040 2018/03/17
CH~MT765E~C0765HFNCC1056 BGDC0865HFNKG1043 2019/01/07
2658358 BGDC0865HFNKG1043 2018/08/09
MT765E~C0765KFNCD1044 C0765KFNCD10 2015/04/07
187A126 C0765KFNCD10 2017/11/30", header = T)
Upvotes: 3