Reputation: 11
I have a dataset with two columns, towns and zip codes. I would like it if every ZIP code had one town rather than having several different towns spread out. That is, I was thinking something like:
for every unique zip-code set all town.names equal to the most frequent town name for that zipcode
So far I have tried something like:
unique.zipcode <- mydata$zipcode[!duplicated(mydata$zipcode)
for(i in 1:length(unique.zipcode){
mydata$town <- names(which.max(table(mydata$town[mydata$zipcode==unique.zipcode[i]])))
}
Upvotes: 1
Views: 49
Reputation: 78822
Simulate some data since you didn't provide any:
library(dplyr)
set.seed(1492)
data_frame(
town=sample(LETTERS[1:6], 100, replace=TRUE),
zip=sample(c("10010", "21000", "30100", "40001"), 100, replace=TRUE)
) -> df
Take a look:
glimpse(df)
## Observations: 100
## Variables: 2
## $ town <chr> "B", "B", "B", "A", "A", "A", "F", "D", "C", "E", "D", "D...
## $ zip <chr> "21000", "30100", "21000", "21000", "10010", "21000", "40...
Prove we have dups:
count(df, zip)
## # A tibble: 4 × 2
## zip n
## <chr> <int>
## 1 10010 27
## 2 21000 24
## 3 30100 19
## 4 40001 30
Eliminate them by frequency & deal with ties.
count(df, zip, town) %>%
top_n(1) %>%
slice(1)
## Source: local data frame [4 x 3]
## Groups: zip [4]
##
## zip town n
## <chr> <chr> <int>
## 1 10010 C 7
## 2 21000 D 6
## 3 30100 F 6
## 4 40001 D 8
Upvotes: 1