swhusky
swhusky

Reputation: 305

copying rows based on column values

I'm stumped. How can I copy data to another row within an ID variable based on if the rows match. I'm working with thousands of historical addresses and not all the addresses match perfectly. But any differences are typically at the end of the address so using the first 4 or 5 characters of that value should handle it. I would like to fill in the NAs with the appropriate tract code. I've been trying dplyr solutions and not getting anywhere. Any thoughts would be much appreciated.

ID<-c(50,50,50,71,71,71)
ID_Y<-c(505,506,507,715,716,717)
address<-c("325 Park St N","325 Park St","325 Park","616 Holly","616 Holly Dr","510 Walnut Dr")
tract<-c(110,NA,NA,223,NA,989)

AD567<-data.frame(ID,ID_Y,address,tract)
AD567
  ID ID_Y       address tract
1 50  505 325 Park St N   110
2 50  506   325 Park St   NA
3 50  507      325 Park   NA
4 71  715     616 Holly   223
5 71  716  616 Holly Dr   NA
6 71  717 510 Walnut Dr   989

Trying to get here:

  ID ID_Y       address tract
1 50  505 325 Park St N   110
2 50  506   325 Park St   110
3 50  507      325 Park   110
4 71  715     616 Holly   223
5 71  716  616 Holly Dr   223
6 71  717 510 Walnut Dr   989

Upvotes: 1

Views: 382

Answers (1)

Claude
Claude

Reputation: 1784

This is a solution without any additional library

# introduce an additional column which serves as heuristic key
AD567$prefix = substr(AD567$address, 1, 8)

# extract all records which have a tract code
TRACT = AD567[! is.na(AD567$tract),c("prefix", "tract")]
# check if the record is unique per prefix
aggregate(tract ~ prefix, TRACT, length)
# ... one may use only those records further on which are unique ...

# merge both data frames to inject the tract code; make sure nothing
# is lost from AD567
AD567 = merge(AD567, TRACT, by="prefix", suffixes = c("", ".ref"), all.x = TRUE)

# copy over tract code
AD567$tract = AD567$tract.ref

# remove utility columns
AD567 = AD567[, ! colnames(AD567) %in% c("prefix", "tract.ref")]

Please keep in mind that this is a very poor heuristic. Inexact or fuzzy data matching is a science on its own.

Upvotes: 1

Related Questions