Remove infrequent rows for the count of values in EACH column

Question

I have data with several infrequent levels in each column. Find the dput() of a small portion of data at the bottom of the question. Here's how the data is structured;

   AppointmentMonth DayofWeek AppointmentHour   EncounterType
1               Sep       Mon              16    Office Visit
2               Jun       Tue              13    Office Visit
3               Sep       Mon              14 Procedure Visit
4               Dec       Thu              14    Office Visit
5               Mar       Tue              11    Office Visit
6               May       Fri              14    Office Visit
7               May       Tue              11    Office Visit
8               May       Tue               9    Office Visit
.......

When you check the frequency table of each column you will realize some levels have 0, 1, or 2 occurrences. Here is an example.

table(data$AppointmentHour)
8  9 10 11 12 13 14 15 16 17 
4  4  2  4  1  2  8  1  3  1

I would like to identify and remove the infrequent levels whose frequencies, let's say less than 3 (can be changed depending on the problem/data), in each column.

I tried @akrun's answer in this question. Here is the code chunk:

library(data.table)

setDT(data)[data[, .I[.N >= 3], 
      by = .(AppointmentMonth, DayofWeek, AppointmentHour, EncounterType)]$V1]

However, this code removes the infrequent levels based on the combination of columns, not based on each individual column.

Data:

data <- structure(list(AppointmentMonth = structure(c(9L, 6L, 9L, 12L, 
3L, 5L, 5L, 5L, 7L, 10L, 9L, 12L, 7L, 3L, 11L, 9L, 11L, 12L, 
12L, 7L, 1L, 6L, 7L, 12L, 1L, 3L, 11L, 4L, 9L, 4L), levels = c("Jan", 
"Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", 
"Nov", "Dec"), class = c("ordered", "factor")), DayofWeek = structure(c(2L, 
3L, 2L, 5L, 3L, 6L, 3L, 3L, 3L, 5L, 4L, 2L, 2L, 4L, 2L, 5L, 3L, 
2L, 4L, 3L, 4L, 6L, 6L, 5L, 2L, 2L, 3L, 2L, 3L, 5L), levels = c("Sun", 
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat"), class = c("ordered", 
"factor")), AppointmentHour = c(16L, 13L, 14L, 14L, 11L, 14L, 
11L, 9L, 9L, 11L, 12L, 10L, 16L, 15L, 8L, 8L, 11L, 8L, 14L, 8L, 
16L, 9L, 14L, 14L, 13L, 9L, 10L, 14L, 17L, 14L), EncounterType = structure(c(`Office Visit` = 1L, 
`Office Visit` = 1L, `Procedure Visit` = 2L, `Office Visit` = 1L, 
`Office Visit` = 1L, `Office Visit` = 1L, `Office Visit` = 1L, 
`Office Visit` = 1L, `Office Visit` = 1L, Appointment = 3L, `Office Visit` = 1L, 
`Office Visit` = 1L, `Office Visit` = 1L, `Office Visit` = 1L, 
`Office Visit` = 1L, `Office Visit` = 1L, `Office Visit` = 1L, 
`Procedure Visit` = 2L, `Office Visit` = 1L, `Office Visit` = 1L, 
`Office Visit` = 1L, `Office Visit` = 1L, `Office Visit` = 1L, 
`Procedure Visit` = 2L, `Office Visit` = 1L, `Office Visit` = 1L, 
`Office Visit` = 1L, `Office Visit` = 1L, `Office Visit` = 1L, 
`Office Visit` = 1L), levels = c("Office Visit", "Procedure Visit", 
"Appointment", "Treatment", "Telemedicine"), class = "factor")), row.names = c(NA, 
30L), class = "data.frame")

M-- · Accepted Answer

Update:

We can use inner_join after filtering out rows based on each column:

library(dplyr)
library(rlang)
library(purrr)

map(names(data), ~ data %>% 
                    add_count(!!sym(.x)) %>% 
                    filter(n >=3) %>% 
                    select(-n)) %>% 
  reduce(inner_join, by = names(data))

#> # A tibble: 18 x 4
#>    AppointmentMonth DayofWeek AppointmentHour EncounterType  
#>                                          
#>  1 Sep              Mon                    16 Office Visit   
#>  2 Sep              Mon                    14 Procedure Visit
#>  3 Dec              Thu                    14 Office Visit   
#>  4 Mar              Tue                    11 Office Visit   
#>  5 May              Fri                    14 Office Visit   
#>  6 May              Tue                    11 Office Visit   
#>  7 May              Tue                     9 Office Visit   
#>  8 Jul              Tue                     9 Office Visit   
#>  9 Jul              Mon                    16 Office Visit   
#> 10 Nov              Mon                     8 Office Visit   
#> 11 Sep              Thu                     8 Office Visit   
#> 12 Nov              Tue                    11 Office Visit   
#> 13 Dec              Mon                     8 Procedure Visit
#> 14 Dec              Wed                    14 Office Visit   
#> 15 Jul              Tue                     8 Office Visit   
#> 16 Jul              Fri                    14 Office Visit   
#> 17 Dec              Thu                    14 Procedure Visit
#> 18 Mar              Mon                     9 Office Visit

Original Answer:

col.tally <- function(dat, col) {
  temp <- paste0(col, "_count")
  dat %>% 
    group_by(!!col) %>% 
    mutate(!!temp := n()) %>% 
    ungroup()
}


map(names(data), ~col.tally(data, sym(.x))) %>% 
  reduce(left_join, by = names(data)) %>% 
  filter(across(contains("_count"), ~.x >= 3)) %>% 
  select(names(data))

Remove infrequent rows for the count of values in EACH column

Data:

Answers (1)

Update:

Original Answer:

Related Questions