Keeping rows from the previous year

Question

I am just going to give it a try as I know that here are some smart people who might have an r -code for this. I wont be able to code this by myself.

So I got a dataset that contains the names and years-months between 2000-01 and 2008-12. Looking like this:

 Name      Date
 A         2000-01
 A         2000-02
 A          ...
 A         2008-12
 A         2000-01
 B         2000-01
 B          ...
 B         2008-12
 C         and so on..

It can happen that for each name in my key column there is one value for each year. Thats the best I can ask for. Unfortunately some years dont have a value in my key column. Getting further in my dataset looking only at Name A:

So if I do not have 1 observations for every year between 2000-2008 and I want to get the row from the year and month that does not have a value for my key column based on the month from the year on the next observation. In this example:

2003-02 has a value for my keycolumn and 2002-02 does not, I want to get back the row from the date 2002-02 and Name A. In a nutshell: "Keeping rows from the previous year based on key column from the next year"

Is there some easy way to code this? Thank you :)

Weihuang Wong · Accepted Answer

There's no straightforward and easy way to code what you're describing, but it's certainly possible to break the problem down into easier parts. The core part of the problem is as follows. Given a dataframe of rows with non-NA values, e.g.

  year month
1 2002    12
2 2005    11
3 2006    01
4 2008    07

for each row, check the dataframe to see if the previous year exists; if yes, return the row, if no, return an additional row with the previous year and the same month. Here's what a function to do that might look like

check_ym <- function(y, m, dat) {
  if ((y - 1) %in% dat$year) {
    return(data.frame(Date = paste(y, m, sep = "-"), stringsAsFactors = FALSE))
  } else {
    return(data.frame(Date = paste(c(y - 1, y), c(m, m), sep = "-"), stringsAsFactors = FALSE))
  }
}

Now, let's make some fake data.

library(dplyr)
library(tidyr)
library(purrr)

# Simulate data
set.seed(123)
x <- data.frame(Date = paste(sample(2000:2008, 4),
                             sprintf("%02d", sample(1:12, 4, replace = TRUE)),
                             sep = "-"),
                KeyColumn = floor(runif(4, 1, 10)))
d <- data.frame(Date = paste(rep(2000:2008, each = 12),
                             sprintf("%02d", rep(1:12, times = 9)),
                             sep = "-")) %>%
  left_join(x)

Identify the non-NA rows:

dd <- d %>%
  na.omit() %>%
  separate(Date, into = c("year", "month")) %>%
  mutate(year = as.numeric(year))
dd
#   year month KeyColumn
# 1 2002    12         5
# 2 2005    11         5
# 3 2006    01         5
# 4 2008    07         9

Then, we run the function above, iterating through the year and month columns. This gives us

out <- map2_df(dd$year, dd$month, .f = check_ym, dat = dd)
out
#      Date
# 1 2001-12
# 2 2002-12
# 3 2004-11
# 4 2005-11
# 5 2006-01
# 6 2007-07
# 7 2008-07

Finally, we join this with our original data:

inner_join(out, d)
# Joining, by = "Date"
#      Date KeyColumn
# 1 2001-12        NA
# 2 2002-12         5
# 3 2004-11        NA
# 4 2005-11         5
# 5 2006-01         5
# 6 2007-07        NA
# 7 2008-07         9

This is just for one Name. We can also do this for many Names. First create some fake data:

# Simulate data
set.seed(123)
d <- map_df(setNames(1:3, LETTERS[1:3]), function(...) {
  x <- data.frame(Date = paste(sample(2000:2008, 4),
                               sprintf("%02d", sample(1:12, 4, replace = TRUE)),
                               sep = "-"),
                  KeyColumn = floor(runif(4, 1, 10)))
  data.frame(Date = paste(rep(2000:2008, each = 12),
                               sprintf("%02d", rep(1:12, times = 9)),
                               sep = "-")) %>%
    left_join(x)
}, .id = "Name")
dd <- d %>%
  na.omit() %>%
  separate(Date, into = c("year", "month")) %>%
  mutate(year = as.numeric(year))
dd
#    Name year month KeyColumn
# 1     A 2002    12         5
# 2     A 2005    11         5
# 3     A 2006    01         5
# 4     A 2008    07         9
# 5     B 2000    04         6
# 6     B 2004    01         7
# 7     B 2005    12         9
# 8     B 2006    03         9
# 9     B 2000    04         6
# 10    C 2003    12         1
# 11    C 2005    04         7
# 12    C 2006    11         5
# 13    C 2008    02         8

Now, use split to split the dataframe into three dataframes by Name; for each sub-dataframe, we apply check_ym(), and then we combine the results together and join it with the original data:

lapply(split(dd, dd$Name), function(dat) {
  map2_df(dat$year, dat$month, .f = check_ym, dat = dat)
}) %>%
  bind_rows(.id = "Name") %>%
  inner_join(d)
# Joining, by = c("Name", "Date")
#    Name    Date KeyColumn
# 1     A 2001-12        NA
# 2     A 2002-12         5
# 3     A 2004-11        NA
# 4     A 2005-11         5
# 5     A 2006-01         5
# 6     A 2007-07        NA
# 7     A 2008-07         9
# 8     B 2000-04         6
# 9     B 2003-01        NA
# 10    B 2004-01         7
# 11    B 2005-12         9
# 12    B 2006-03         9
# 13    C 2002-12        NA
# 14    C 2003-12         1
# 15    C 2004-04        NA
# 16    C 2005-04         7
# 17    C 2006-11         5
# 18    C 2007-02        NA
# 19    C 2008-02         8

Keeping rows from the previous year

Answers (1)

Related Questions