Andrea Paterlini
Andrea Paterlini

Reputation: 41

removing groups with a certain NA number

Sorry to bother with a relatively simple question perhaps. I have this type of dataframe:

A long list of names in the column "NAME" c(a, b, c, d, e ...) , two potential classes in the column "SURNAME" c(A, B) and a third column containing values. I want to remove all NAMES for which at least in one of the SURNAME classes I have more than 2 "NA" in the VALUE column. I wanted to post an example dataset but I am struggling to format it properly

I was trying to use

  df <- df %>% 
  group_by(NAME) %>% 
  group_by(SURNAME) %>% 
  filter(!is.na(VALUE)) %>% 
  filter(length(VALUE)>=3)

it does not throw an error but I have the impression that something is wrong. Any suggestion? Many thanks

Upvotes: 2

Views: 935

Answers (2)

DanY
DanY

Reputation: 6073

Let's create a dataset to work with:

set.seed(1234)
df <- data.frame(
    name     = sample(x=letters,       size=1e3, replace=TRUE),
    surname  = sample(x=c("A", "B"),   size=1e3, replace=TRUE),
    value    = sample(x=c(1:10*10,NA), size=1e3, replace=TRUE),
    stringsAsFactors = FALSE
)

Here's how to do it with Base R:

# count NAs by name-surname combos (na.action arg is important!)
agg <- aggregate(value ~ name + surname, data=df, FUN=function(x) sum(is.na(x)), na.action=NULL)

# rename is count of NAs column
names(agg)[3] <- "number_of_na"

#add count of NAs back to original data
df <- merge(df, agg, by=c("name", "surname"))

# subset the original data
result <- df[df$number_of_na < 3, ]

Here's how to do it with data.table:

library(data.table)
dt <- as.data.table(df)

dt[ , number_of_na := sum(is.na(value)), by=.(name, surname)]
result <- dt[number_of_na < 3]

Here's how to do it with dplr/tidyverse:

library(dplyr) # or library(tidyverse)
result <- df %>% 
    group_by(name, surname) %>% 
    summarize(number_of_na = sum(is.na(value))) %>% 
    right_join(df, by=c("name", "surname")) %>% 
    filter(number_of_na < 3)

Upvotes: 2

akrun
akrun

Reputation: 886938

After grouping by 'NAME', 'SURNAME', create a column with the number of NA elements in that group and then filter out any 'NAME' that have an 'ind' greater than or equal to 3

df %>%
    group_by(NAME, SURNAME) %>%
    mutate(ind = sum(is.na(VALUE))) %>% 
    group_by(NAME) %>%
    filter(!any(ind >=3)) %>%
    select(-ind)

Or do an anti_join after doing the filtering by 'NAME', 'SURNAME' based on the condition

df %>% 
   group_by(NAME, SURNAME) %>% 
   filter(sum(is.na(VALUE))>=3) %>% 
   ungroup %>% 
   distinct(NAME)  %>%
   anti_join(df, .)

data

set.seed(24)
df <- data.frame(NAME = rep(letters[1:5], each = 20), 
                SURNAME = sample(LETTERS[1:4], 5 * 20, replace = TRUE), 
                VALUE = sample(c(NA, 1:3), 5 *20, replace = TRUE), 
                   stringsAsFactors = FALSE)

Upvotes: 3

Related Questions