German Balyanin
German Balyanin

Reputation: 13

How to remove subjects with missing yearly observations in R?

     num Name  year   age       X 
1      1   A   2011    68  116292
2      1   A   2012    69   46132
3      1   A   2013    70    7042
4      1   A   2014    71 -100425
5      1   A   2015    72    6493
6      2   B   2011    20   -8484
7      3   C   2015    23 -120836
8      4   D   2011     3  -26523
9      4   D   2012     4    9923
10     4   D   2013     5   82432

I have the data which is represented by various subjects in 5 years. I need to remove all the subjects, which are missing any of years from 2011 to 2015. How can I accomplish it, so in given data only subject A is left?

Upvotes: 1

Views: 1617

Answers (4)

Matt L.
Matt L.

Reputation: 2964

Here is a slightly more straightforward tidyverse solution. First, expand the dataframe to include all combinations of Name + year:

df %>% complete(Name, year)

    # A tibble: 20 x 5
     Name  year   num   age       X
   <fctr> <int> <int> <int>   <int>
 1      A  2011     1    68  116292
 2      A  2012     1    69   46132
 3      A  2013     1    70    7042
 4      A  2014     1    71 -100425
 5      A  2015     1    72    6493
 6      B  2011     2    20   -8484
 7      B  2012    NA    NA      NA
 8      B  2013    NA    NA      NA
 9      B  2014    NA    NA      NA
10      B  2015    NA    NA      NA
...

Then extend the pipe to group by "Name", and filter to keep only those with 0 NA values:

df %>% complete(Name, year) %>%
  group_by(Name) %>%
  filter(sum(is.na(age)) == 0)

# A tibble: 5 x 5
# Groups:   Name [1]
    Name  year   num   age       X
  <fctr> <int> <int> <int>   <int>
1      A  2011     1    68  116292
2      A  2012     1    69   46132
3      A  2013     1    70    7042
4      A  2014     1    71 -100425
5      A  2015     1    72    6493

Upvotes: 2

Samuel
Samuel

Reputation: 3053

Here is a somewhat different approach using tidyverse packages:

library(tidyverse)

df <- read.table(text = "     num Name  year   age       X 
1      1   A   2011    68  116292
2      1   A   2012    69   46132
3      1   A   2013    70    7042
4      1   A   2014    71 -100425
5      1   A   2015    72    6493
6      2   B   2011    20   -8484
7      3   C   2015    23 -120836
8      4   D   2011     3  -26523
9      4   D   2012     4    9923
10     4   D   2013     5   82432")

df2 <- spread(data = df, key = Name, value = year)
x <- colSums(df2[, 4:7], na.rm = TRUE) > 10000
df3 <- select(df2, num, age, X, c(4:7)[x])
df4 <- na.omit(df3)

All steps can of course be constructed as one single pipe with the %>% operator.

Upvotes: 0

Mike H.
Mike H.

Reputation: 14360

Using data.table:

A data.table solution might look something like this:

library(data.table)
dt <- as.data.table(df)
dt[, keep := identical(unique(year), 2011:2015), by = Name ][keep == T, ][,keep := NULL]
#   num Name year age       X
#1:   1    A 2011  68  116292
#2:   1    A 2012  69   46132
#3:   1    A 2013  70    7042
#4:   1    A 2014  71 -100425
#5:   1    A 2015  72    6493

This is more strict in that it requires that the unique years be exactly equal to 2011:2015. If there is a 2016, for example that person would be excluded.


A less restrictive solution would be to check that 2011:2015 is in your unique years. This should work:

dt[, keep := all(2011:2015 %in% unique(year)), by = Name ][keep == T, ][,keep := NULL]

Thus, if for example, A had a 2016 year and a 2010 year it would still keep all of A. But if anyone is missing a year in 2011:2015 this would exclude them.


Using base R & aggregate:

Same option, but using aggregate from base R:

agg <- aggregate(df$year, by = list(df$Name), FUN = function(x) all(2011:2015 %in% unique(x)))
df[df$Name %in% agg[agg$x == T, 1] ,]

Upvotes: 2

G5W
G5W

Reputation: 37661

Just check which names have the right number of entries.

## Reproduce your data
df = read.table(text="     num Name  year   age       X 
1      1   A   2011    68  116292
2      1   A   2012    69   46132
3      1   A   2013    70    7042
4      1   A   2014    71 -100425
5      1   A   2015    72    6493
6      2   B   2011    20   -8484
7      3   C   2015    23 -120836
8      4   D   2011     3  -26523
9      4   D   2012     4    9923
10     4   D   2013     5   82432",
header=TRUE)

Tab = table(df$Name)
Keepers = names(Tab)[which(Tab == 5)]
df[df$Name %in% Keepers,]
  num Name year age       X
1   1    A 2011  68  116292
2   1    A 2012  69   46132
3   1    A 2013  70    7042
4   1    A 2014  71 -100425
5   1    A 2015  72    6493

Upvotes: 1

Related Questions