Reputation: 55
I have a list of lists that contain the following 2 variables:
> dist_sub[[1]]$zip
[1] 901 902 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928
[26] 929 930 931 933 934 935 936 937 938 939 940 955 961 962 963 965 966 968 969 970 975 981
> dist_sub[[1]]$hu
[1] 4990 NA 168 13224 NA 3805 NA 6096 3884 4065 NA 16538 NA 12348 10850 NA
[17] 9322 17728 NA 13969 24971 5413 47317 7893 NA NA NA NA NA 140 NA 4
[33] NA NA NA NA NA 13394 8939 NA 3848 7894 2228 17775 NA NA NA
> dist_sub[[2]]$zip
[1] 921 934 952 956 957 958 959 960 961 962 965 966 968 969 970 971
> dist_sub[[2]]$hu
[1] 17728 140 4169 32550 18275 NA 22445 0 13394 8939 3848 7894 2228 17775 NA 12895
Is there a way remove duplicates such that if a zipcode appears in one list is removed from other lists according to specific criteria?
Example: zipcode 00921 is present in the two lists above. I'd like to keep it only on the list with the lowest sum of hu (housing units). In this I would like to keep zipcode 00921 in the 2nd list only since the sum of hu is 162,280 in list 2 versus 256,803 in list 1.
Any help is very much appreciated.
Upvotes: 1
Views: 1977
Reputation: 4432
Here is a simulate dataset for your problem so that others can use it too.
dist_sub <- list(list("zip"=1:10,
"hu"=rnorm(10)),
list("zip"=8:12,
"hu"=rnorm(5)),
list("zip"=c(1, 3, 11, 7),
"hu"=rnorm(4))
)
Here's a solution that I was able to come up with. I realized that loops were really the cleaner way to do this:
do.this <- function (x) {
for(k in 1:(length(x) - 1)) {
for (l in (k + 1):length(x)) {
to.remove <- which(x[[k]][["zip"]] %in% x[[l]][["zip"]])
x[[k]][["zip"]] <- x[[k]][["zip"]][-to.remove]
x[[k]][["hu"]] <- x[[k]][["hu"]][-to.remove]
}
}
return(x)
}
The idea is really simple: for each set of zips we keep removing the elements that are repeated in any set after it. We do this until the penultimate set because the last set will be left with no repeats in anything before it.
The solution to use the criterion you have, i.e. lowest sum of hu
can be easily implemented using the function above. What you need to do is reorder the list dist_sub
by sum of hu
like so:
sum_hu <- sapply(dist_sub, function (k) sum(k[["hu"]], na.rm=TRUE))
dist_sub <- dist_sub[order(sum_hu, decreasing=TRUE)]
Now you have dist_sub
sorted by sum_hu
which means that for each set the sets that come before it have larger sum_hu
. Therefore, if sets at values i
and j
(i < j) have values a
in common, then a
should be removed from i
th element. That is what this solution does too. Do you think that makes sense?
PS: I've called the function do.this
because I usually like writing generic solutions while this was a very specific question, albeit, an interesting one.
Upvotes: 2