Josh Brinks
Josh Brinks

Reputation: 141

lapply - Create new variables, based on current variables, conditionally based on info in 2nd data frame

I've been working on a lot of new projects where I'm doing lots of unfamiliar data prep and management.

I have two data frames: 1) that is very large with thousands of observations and variables (df1), and 2) a data frame that lists ranges of collection years for a subset of variables in df1 (df2). I need to create a new variable in df1 for a large subset of variables/columns in df1. The new variables created for df1 will check if a value is present (1), a value isn't present for a year that was collected (0), or a value isn't present and the year falls outside of the collection range listed in df2 ('NA').

I've spent a couple days reading a ton of lapply() but I can't seem to find one that addresses my needs or deals with type of complexity such that I don't have to do this with brute force.

Here is my workable starting data frame:

grp <- c('a', 'a', 'a', 'b', 'b')
year <- c(1991, 1992, 1993, 2005, 2010)
v1 <- c(20.5, 30.5, 29.6, 28.7, 26.1)
v2 <- c(100.0, 101.5, 105.1, 'NA', 95.0)
v3 <- c(47.2, 'NA', 'NA', 'NA', 'NA')
df1 <- data.frame(grp = grp, year = year, v1 = v1, v2 = v2, v3 = v3)
df1

grp year   v1    v2   v3
a   1991  20.5   100 47.2
a   1992  30.5 101.5   NA
a   1993  29.6 105.1   NA
b   2005  28.7    NA   NA
b   2010  26.1    95   NA

Here is my reference data frame with coverages for variables in df1:

vars <- c('v1', 'v2', 'v3')
start <- c(1989, 2004, 1980)
end <- c(2015, 2011, 1994)
df2 <- data.frame(vars = vars, start = start, end = end)
df2

vars start  end
 v1  1989 2015
 v2  2004 2011
 v3  1980 1994

I've been learning with simple stuff with 'lapply()' like:

test <- df1[paste0(vars, '.cov')] <- lapply(df1[vars], function(x) as.integer(x > 0))

I wrote out in R, what I think are, the types of conditions that need to be met. I'll narrate with written English:

  1. a value is present for a year that was collected (1)

    if (!is.na(x)) { x <- 1 }
    
  2. a value isn't present for a year that falls within the range listed in df2 (0)

    if (is.na(x) & year %in% seq(df2$start[df2$vars == names(df1[x]), ], df2$end[df2$vars == names(df1[x]), ], 1)) { x <- 0 }
    
  3. a value isn't present and the year falls outside of the collection range listed in df2 ('NA')

    if (is.na(x) & !(year %in% seq(df2$start[df2$vars == names(df1[x]), ], df2$end[df2$vars == names(df1[x]), ], 1))) { x <- 'NA' }
    

I did my best with the syntax and indexing, but we're rapidly getting out of my comfort zone.

After running the conditional checks the desired output/modified df1 should look like:

grp year   v1    v2   v3 v1.cov v2.cov v3.cov
  a 1991 20.5   100 47.2      1      1      1
  a 1992 30.5 101.5   NA      1      1      0
  a 1993 29.6 105.1   NA      1      1      0
  b 2005 28.7    NA   NA      1      0     NA
  b 2010 26.1    95   NA      1      1     NA

I'm open to a variety of solutions but this seemed the likely path to move through. Thanks again for all the help. I'm an experienced R modeler/scientist but I've learned so much data prep, 'data.table', and 'dplyr' in the past month with all your help.

Upvotes: 3

Views: 259

Answers (1)

Frank
Frank

Reputation: 66819

With data.table:

library(data.table)
setDT(df1)
DT = melt(df1, id = c("grp", "year"), meas = patterns("^v"))[, value := type.convert(as.character(value))]

# mark based on whether found or not within collection periods
DT[df2, on=.(variable = vars, year >= start, year <= end), 
  found := as.integer(!is.na(value))]

# also mark if found outside collection periods
DT[!is.na(value) & is.na(found), found := 1L ]

which gives

    grp year variable value found
 1:   a 1991       v1  20.5     1
 2:   a 1992       v1  30.5     1
 3:   a 1993       v1  29.6     1
 4:   b 2005       v1  28.7     1
 5:   b 2010       v1  26.1     1
 6:   a 1991       v2 100.0     1
 7:   a 1992       v2 101.5     1
 8:   a 1993       v2 105.1     1
 9:   b 2005       v2    NA     0
10:   b 2010       v2  95.0     1
11:   a 1991       v3  47.2     1
12:   a 1992       v3    NA     0
13:   a 1993       v3    NA     0
14:   b 2005       v3    NA    NA
15:   b 2010       v3    NA    NA

(type.convert is used to override OP's encoding of missing data with string 'NA'.)


The melt step only makes sense here because the variables seem to be of the same type (numeric). If they aren't, something similar can be done by looping over each column:

setDT(df1)
setDT(df2)
for (v in unique(df2$vars)){
  df1[, (v) := type.convert(as.character(get(v)))]

  fcol = paste0("found.",v)
  df1[df2[vars == v], on=.(year >= start, year <= end), 
    (fcol) := as.integer(!is.na(get(v)))]
  df1[!is.na(get(v)) & is.na(get(fcol)), (fcol) := 1L ]
}

   grp year   v1    v2   v3 found.v1 found.v2 found.v3
1:   a 1991 20.5 100.0 47.2        1        1        1
2:   a 1992 30.5 101.5   NA        1        1        0
3:   a 1993 29.6 105.1   NA        1        1        0
4:   b 2005 28.7    NA   NA        1        0       NA
5:   b 2010 26.1  95.0   NA        1        1       NA

Upvotes: 2

Related Questions