Reputation: 305
I'm stumped. How can I copy data to another row within an ID variable based on if the rows match. I'm working with thousands of historical addresses and not all the addresses match perfectly. But any differences are typically at the end of the address so using the first 4 or 5 characters of that value should handle it. I would like to fill in the NAs with the appropriate tract code. I've been trying dplyr solutions and not getting anywhere. Any thoughts would be much appreciated.
ID<-c(50,50,50,71,71,71)
ID_Y<-c(505,506,507,715,716,717)
address<-c("325 Park St N","325 Park St","325 Park","616 Holly","616 Holly Dr","510 Walnut Dr")
tract<-c(110,NA,NA,223,NA,989)
AD567<-data.frame(ID,ID_Y,address,tract)
AD567
ID ID_Y address tract
1 50 505 325 Park St N 110
2 50 506 325 Park St NA
3 50 507 325 Park NA
4 71 715 616 Holly 223
5 71 716 616 Holly Dr NA
6 71 717 510 Walnut Dr 989
Trying to get here:
ID ID_Y address tract
1 50 505 325 Park St N 110
2 50 506 325 Park St 110
3 50 507 325 Park 110
4 71 715 616 Holly 223
5 71 716 616 Holly Dr 223
6 71 717 510 Walnut Dr 989
Upvotes: 1
Views: 382
Reputation: 1784
This is a solution without any additional library
# introduce an additional column which serves as heuristic key
AD567$prefix = substr(AD567$address, 1, 8)
# extract all records which have a tract code
TRACT = AD567[! is.na(AD567$tract),c("prefix", "tract")]
# check if the record is unique per prefix
aggregate(tract ~ prefix, TRACT, length)
# ... one may use only those records further on which are unique ...
# merge both data frames to inject the tract code; make sure nothing
# is lost from AD567
AD567 = merge(AD567, TRACT, by="prefix", suffixes = c("", ".ref"), all.x = TRUE)
# copy over tract code
AD567$tract = AD567$tract.ref
# remove utility columns
AD567 = AD567[, ! colnames(AD567) %in% c("prefix", "tract.ref")]
Please keep in mind that this is a very poor heuristic. Inexact or fuzzy data matching is a science on its own.
Upvotes: 1