apply with subset function (or custom function based on subset)

Question

I am trying to find a way to use apply function along with subset (or custom function based on subset). I know similar questions has already been asked, mine is little bit more specific. I need to subset certain part of multiple data sets based on more than one variables. I have couple "types" of data frame structures, one of them looks similar to this:

colour  shade   value
RED LIGHT   -1.05
RED LIGHT   -1.37
RED LIGHT   -0.32
RED LIGHT   0.87
RED LIGHT   -0.2
RED DARK    0.52
RED DARK    -0.2
RED DARK    0.64
RED DARK    1.12
RED DARK    4
BLUE    LIGHT   0.93
BLUE    LIGHT   0.78
BLUE    LIGHT   -1.84
BLUE    LIGHT   -0.5
BLUE    LIGHT   -1.11
BLUE    DARK    -4.86
BLUE    DARK    1.11
BLUE    DARK    0.14
BLUE    DARK    0.12
BLUE    DARK    -1.65
GREEN   LIGHT     3.13
GREEN   LIGHT   2.65
GREEN   LIGHT   -2.36
GREEN   LIGHT   -3.11
GREEN   LIGHT   3.49
GREEN   DARK    1.91
GREEN   DARK    -1.1
GREEN   DARK    -1.93
GREEN   DARK    1
GREEN   DARK    -0.23

I have lot of those. They names are stored in

list.dfs.names=df1,df2,df3

Based on this I need to use subset or custom function based on it:

customSubset=function(df,col,shade){subset(df,df$colour %in% col & df$shade %in% shade)}

I use custom functions like this because as I said I have couple types of df structures and it speeds up my work a little bit. It works like this:

example=customSubset(df1,"BLUE","DARK")

and output is:

   colour shade value
11   BLUE LIGHT  0.93
12   BLUE LIGHT  0.78
13   BLUE LIGHT -1.84
14   BLUE LIGHT -0.50
15   BLUE LIGHT -1.11
16   BLUE  DARK -4.86
17   BLUE  DARK  1.11
18   BLUE  DARK  0.14
19   BLUE  DARK  0.12
20   BLUE  DARK -1.65

Till now I was using for loops but I want to change my approach to apply which seems to be more convenient especially where nesting loops is required. So I tired:

lapply(customSubset(list.dfs.names, "BLUE","DARK") )

and

lapply(list.dfs.names, customSubset("BLUE","DARK") )

with no success. Could anyone give mi little hand on this issue, I dont think I clearly understand how apply loops works. However I am quite familiar with for method so any additional explanation about differences would be appreciated.

If it is not possible with customSubset its ok for me to use regular subset or any other method that produces same result as example presented above.

Thank you in advance

EDIT: here is code to produce similar df to example i posted:

`data.frame("colour"=(c(rep("RED",10),rep("BLUE",10),rep("GREEN",10)))
           ,"shade"=c(rep(c(rep("LIGHT",5),rep("DARK",5)),3))
           , runif(30,min=0,max=1))`

EDIT2:As requested I am editing my post to expand it on my year problem. My dfs comes from different years (multiple from each) for example like this: df.1.2012, df.2.2012,df.1.2011 and so on. The main issue is that I never need to refer to same year in all of dfs (it would be very easy then) instead I need to subset data based on certain horizon (example: year+2 or year-1). I used to create list of desired years (example with year+2 it would be list.year=c(2014,2014,2013)) which was paired with list of my dfs (that how it worked with for loop).

I need to find similar method for apply approach. Here is example:

set.seed(200)

 df_2014=data.frame(colour=(c(rep("RED",10),rep("BLUE",10),rep("GREEN",10)))
           ,shade=c(rep(c(rep("LIGHT",5),rep("DARK",5)),3))
           ,year=c(rep(2011:2015,6))
           ,value=runif(30,min=0,max=1))

 df_2013=data.frame(colour=(c(rep("RED",10),rep("BLUE",10),rep("GREEN",10)))
           ,shade=c(rep(c(rep("LIGHT",5),rep("DARK",5)),3))
           ,year=c(rep(2011:2015,6))
           ,value=runif(30,min=0,max=1))
horizon=+1

subset(df_2014, df_2014$colour %in% "BLUE" & df_2014$shade %in% "DARK" & df_2014$year %in% c(2014+horizon))
subset(df_2013, df_2013$colour %in% "BLUE" & df_2013$shade %in% "DARK" & df_2013$year %in% c(2013+horizon))

So i added column with years and i called it year and named dfs after year (so year+1 would be here 2014+1). Horizon is self explanatory. Result is:

#df_2014
      colour shade year   value
 20   BLUE  DARK 2015 0.6463296

#df_2013

   colour shade year     value
20   BLUE  DARK 2015 0.6532767

I need to use apply function to list of data frames (in this edit list.df=list(df_2014,df_2013) as in previous example but this time add subset condition year+horizon (and possible puts all result in one df, but this is not main issue here).

In conclusion: when you look at both my subset function in this part in year+horizon, year has to change based on which df(from list) in loop it refers (while horizon is constant).

If you have trouble understanding what I mean please let me know, I tried to be very specific.

Rui Barradas · Accepted Answer

The problem seems to be the construct

subset(df,df$colour %in% col & df$shade %in% shade)

You are using subset, that evaluates the logical expression in the environment of its first argument, df, and then doing df$shade %in% shade. This is equivalent to shade %in% shade, since the df is the first argument. You should rewrite the function as follows, to use different names will do the trick.

customSubset <- function(DF, COL, SHADE){
    subset(DF, colour %in% COL & shade %in% SHADE)
}

Now everything works as expected.

set.seed(5601)    # make the results reproducible

df1 <- data.frame(colour = sample(c("RED", "GREEN", "BLUE"), 30, TRUE),
                  shade = sample(c("LIGHT", "DARK"), 30, TRUE),
                  value = rnorm(30, sd = 9))
df2 <- data.frame(colour = c(rep("RED",10), rep("BLUE",10), rep("GREEN",10))
           ,shade=c(rep(c(rep("LIGHT",5),rep("DARK",5)), 3))
           , value = runif(30,min=0,max=1))

list.dfs <- list(df1, df2)

customSubset(df1,"BLUE","DARK")
#   colour shade      value
#5    BLUE  DARK   4.288107
#6    BLUE  DARK   2.860724
#8    BLUE  DARK -10.720379
#10   BLUE  DARK -15.407090
#14   BLUE  DARK  -2.259848
#30   BLUE  DARK -18.364494

# apply the function to all df's in the list
# both forms are equivalent
lapply(list.dfs, function(x) customSubset(x, "BLUE", "DARK"))
lapply(list.dfs, customSubset, "BLUE", "DARK")

apply with subset function (or custom function based on subset)

Answers (1)

Related Questions