Connor Joseph
Connor Joseph

Reputation: 51

How to remove all observations for which there is no observation in the current year in R?

     num Name  year   X         Y 
1      1   A   2015    68      80%
2      1   A   2016    69      85%
3      1   A   2017    70      95%
4      1   A   2018    71      85%
5      1   A   2019    72      90%
6      2   B   2018    20      80%
7      2   B   2019    23      75%
8      2   C   2014     3      55%
9      4   D   2012     4      75%
10     4   D   2013     5      100%

Let's say I have data like the above. I want to remove the observations that do not have any observations in the most recent year. So, in the above, we would be left with A & B, but C & D would be deleted. The most recent season will always in the data and can be referenced with the max() function (i.e., we don't need to hardcode as 2019 and update it yearly).

The plan is to create a facet wrapped line chart where the percentages are on the y-axis and the years are on the x-axis. The facet would be on the names so each individual will have its own line chart with their percentages by year. We don't care about people who left, so that's why we're dropping records. Though, there is a chance they come back, so I don't want to drop them from the underlying data.

Upvotes: 2

Views: 253

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269824

Using the data frame DF shown in the Note at the end we use semi_join to reduce it to the required names, convert Y to numeric and plot it. DF is not modified.

A possible alternative to the semi_join line is

filter(ave(year == max(year), Name, FUN = any)) %>%

The code is--

library(dplyr)
library(ggplot2)

DF %>%
  semi_join(filter(., year == max(year)), by = "Name") %>%
  mutate(Y = as.numeric(sub("%", "", Y))) %>% 
  ggplot(aes(year, Y)) + geom_line() + facet_wrap(~Name)

screenshot

Note

The input in reproducible form:

Lines <- "     num Name  year   X         Y
1      1   A   2015    68      80%
2      1   A   2016    69      85%
3      1   A   2017    70      95%
4      1   A   2018    71      85%
5      1   A   2019    72      90%
6      2   B   2018    20      80%
7      2   B   2019    23      75%
8      2   C   2014     3      55%
9      4   D   2012     4      75%
10     4   D   2013     5      100%"
DF <- read.table(text = Lines)

Upvotes: 2

akrun
akrun

Reputation: 887391

W can use subset from base R as well by subsetting the 'Name' where 'year' is the max, get the unique elements and create a logical vector with %in% to subset the rows

subset(df1, Name %in% unique(Name[year == max(year)]))
# num Name year  X   Y
#1   1    A 2015 68 80%
#2   1    A 2016 69 85%
#3   1    A 2017 70 95%
#4   1    A 2018 71 85%
#5   1    A 2019 72 90%
#6   2    B 2018 20 80%
#7   2    B 2019 23 75%

No packages are used


Or the similar syntax in dplyr

library(dplyr)
df1 %>% 
     filter(Name %in% unique(Name[year == max(year)]))

data

df1 <- structure(list(num = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 4L, 4L
), Name = c("A", "A", "A", "A", "A", "B", "B", "C", "D", "D"), 
    year = c(2015L, 2016L, 2017L, 2018L, 2019L, 2018L, 2019L, 
    2014L, 2012L, 2013L), X = c(68L, 69L, 70L, 71L, 72L, 20L, 
    23L, 3L, 4L, 5L), Y = c("80%", "85%", "95%", "85%", "90%", 
    "80%", "75%", "55%", "75%", "100%")), class = "data.frame", 
    row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"))

Upvotes: 2

tmfmnk
tmfmnk

Reputation: 39868

One dplyr option could be:

df %>%
 group_by(Name) %>%
 filter(any(year %in% max(df$year)))

    num Name   year     X Y    
  <int> <chr> <int> <int> <chr>
1     1 A      2015    68 80%  
2     1 A      2016    69 85%  
3     1 A      2017    70 95%  
4     1 A      2018    71 85%  
5     1 A      2019    72 90%  
6     2 B      2018    20 80%  
7     2 B      2019    23 75%  

Upvotes: 3

Related Questions