Reputation: 693
Hi I am stuck on a weird problem.
I have a dataset like the following:
City =c('XX','YY','XX','XX','YY')
Grade = c('A','B','A','A','B')
Variable1=c(.34,0,.34,0,0)
Variable2=c(.76,.3,0,.76,.3)
Variable3=c(.76,.3,0,.76,0)
Final_data = data.frame(City,Grade,Variable1,Variable2,Variable3)
Final_data <- Final_data[order(City),]
After sorting this looks like this:
As you can see, rows 1,2 and 3 represents the same city (XX) which has a grade A assigned to it. The issue is that there is a duplication of the rows where some columns have 0. Optimally I would only have row 1 for city XX and 1 row for city YY. Specifically, what I would want in this case is that for each city and grade, count the number of zeroes in each row and then take the row which has the least number of zeroes.
One method could be this: https://stackoverflow.com/a/47914087/3988575. However, the above removes all the rows with zero. If you look at rows 4 and 5 with City YY and Grade B, both the rows have some columns with 0. Similarly,
Another way would be to use the unique function as shown here: https://stackoverflow.com/a/31875208/3988575. In this case, the first row meeting the condition will be selected. This also does not help me.
The expected output is something like this:
How would I be able to do this? Any help is appreciated.
Upvotes: 0
Views: 187
Reputation: 3183
You could use dplyr
for that as below:
library(dplyr)
Final_data$CountZero <- apply(Final_data[, -(1:2)], 1, function(x) {
sum(x == 0)
})
Final_data %>%
group_by(City, Grade) %>%
filter(CountZero == min(CountZero)) %>%
select(-CountZero)
# A tibble: 2 x 5
# Groups: City, Grade [2]
City Grade Variable1 Variable2 Variable3
<fct> <fct> <dbl> <dbl> <dbl>
1 XX A 0.34 0.76 0.76
2 YY B 0 0.3 0.3
Upvotes: 2