R remove rows with most zero values (unique and removing all rows with 0 not working)

Question

Hi I am stuck on a weird problem.

I have a dataset like the following:

City =c('XX','YY','XX','XX','YY')
Grade = c('A','B','A','A','B')
Variable1=c(.34,0,.34,0,0)
Variable2=c(.76,.3,0,.76,.3)
Variable3=c(.76,.3,0,.76,0)
Final_data = data.frame(City,Grade,Variable1,Variable2,Variable3)  
Final_data <- Final_data[order(City),]

After sorting this looks like this:

As you can see, rows 1,2 and 3 represents the same city (XX) which has a grade A assigned to it. The issue is that there is a duplication of the rows where some columns have 0. Optimally I would only have row 1 for city XX and 1 row for city YY. Specifically, what I would want in this case is that for each city and grade, count the number of zeroes in each row and then take the row which has the least number of zeroes.

One method could be this: https://stackoverflow.com/a/47914087/3988575. However, the above removes all the rows with zero. If you look at rows 4 and 5 with City YY and Grade B, both the rows have some columns with 0. Similarly,

Another way would be to use the unique function as shown here: https://stackoverflow.com/a/31875208/3988575. In this case, the first row meeting the condition will be selected. This also does not help me.

The expected output is something like this:

How would I be able to do this? Any help is appreciated.

Sonny · Accepted Answer

You could use dplyr for that as below:

library(dplyr)
Final_data$CountZero <- apply(Final_data[, -(1:2)], 1, function(x) {
  sum(x == 0)
})

Final_data %>%
  group_by(City, Grade) %>%
  filter(CountZero == min(CountZero)) %>%
  select(-CountZero)
# A tibble: 2 x 5
# Groups:   City, Grade [2]
  City  Grade Variable1 Variable2 Variable3
                  
1 XX    A          0.34      0.76      0.76
2 YY    B          0         0.3       0.3

R remove rows with most zero values (unique and removing all rows with 0 not working)

Answers (1)

Related Questions