Reputation: 333
I am trying to find a way to use apply
function along with subset
(or custom function based on subset
). I know similar questions has already been asked, mine is little bit more specific. I need to subset certain part of multiple data sets based on more than one variables. I have couple "types" of data frame structures, one of them looks similar to this:
colour shade value
RED LIGHT -1.05
RED LIGHT -1.37
RED LIGHT -0.32
RED LIGHT 0.87
RED LIGHT -0.2
RED DARK 0.52
RED DARK -0.2
RED DARK 0.64
RED DARK 1.12
RED DARK 4
BLUE LIGHT 0.93
BLUE LIGHT 0.78
BLUE LIGHT -1.84
BLUE LIGHT -0.5
BLUE LIGHT -1.11
BLUE DARK -4.86
BLUE DARK 1.11
BLUE DARK 0.14
BLUE DARK 0.12
BLUE DARK -1.65
GREEN LIGHT 3.13
GREEN LIGHT 2.65
GREEN LIGHT -2.36
GREEN LIGHT -3.11
GREEN LIGHT 3.49
GREEN DARK 1.91
GREEN DARK -1.1
GREEN DARK -1.93
GREEN DARK 1
GREEN DARK -0.23
I have lot of those. They names are stored in
list.dfs.names=df1,df2,df3
Based on this I need to use subset
or custom function based on it:
customSubset=function(df,col,shade){subset(df,df$colour %in% col & df$shade %in% shade)}
I use custom functions like this because as I said I have couple types of df structures and it speeds up my work a little bit. It works like this:
example=customSubset(df1,"BLUE","DARK")
and output is:
colour shade value
11 BLUE LIGHT 0.93
12 BLUE LIGHT 0.78
13 BLUE LIGHT -1.84
14 BLUE LIGHT -0.50
15 BLUE LIGHT -1.11
16 BLUE DARK -4.86
17 BLUE DARK 1.11
18 BLUE DARK 0.14
19 BLUE DARK 0.12
20 BLUE DARK -1.65
Till now I was using for
loops but I want to change my approach to apply
which seems to be more convenient especially where nesting loops is required. So I tired:
lapply(customSubset(list.dfs.names, "BLUE","DARK") )
and
lapply(list.dfs.names, customSubset("BLUE","DARK") )
with no success. Could anyone give mi little hand on this issue, I dont think I clearly understand how apply
loops works. However I am quite familiar with for
method so any additional explanation about differences would be appreciated.
If it is not possible with customSubset
its ok for me to use regular subset
or any other method that produces same result as example
presented above.
Thank you in advance
EDIT: here is code to produce similar df to example i posted:
`data.frame("colour"=(c(rep("RED",10),rep("BLUE",10),rep("GREEN",10)))
,"shade"=c(rep(c(rep("LIGHT",5),rep("DARK",5)),3))
, runif(30,min=0,max=1))`
EDIT2:As requested I am editing my post to expand it on my year
problem. My dfs comes from different years (multiple from each) for example like this: df.1.2012
, df.2.2012
,df.1.2011
and so on. The main issue is that I never need to refer to same year in all of dfs (it would be very easy then) instead I need to subset data based on certain horizon (example: year+2
or year-1
). I used to create list of desired years (example with year+2
it would be list.year=c(2014,2014,2013)
) which was paired with list of my dfs (that how it worked with for loop
).
I need to find similar method for apply
approach. Here is example:
set.seed(200)
df_2014=data.frame(colour=(c(rep("RED",10),rep("BLUE",10),rep("GREEN",10)))
,shade=c(rep(c(rep("LIGHT",5),rep("DARK",5)),3))
,year=c(rep(2011:2015,6))
,value=runif(30,min=0,max=1))
df_2013=data.frame(colour=(c(rep("RED",10),rep("BLUE",10),rep("GREEN",10)))
,shade=c(rep(c(rep("LIGHT",5),rep("DARK",5)),3))
,year=c(rep(2011:2015,6))
,value=runif(30,min=0,max=1))
horizon=+1
subset(df_2014, df_2014$colour %in% "BLUE" & df_2014$shade %in% "DARK" & df_2014$year %in% c(2014+horizon))
subset(df_2013, df_2013$colour %in% "BLUE" & df_2013$shade %in% "DARK" & df_2013$year %in% c(2013+horizon))
So i added column with years and i called it year
and named dfs after year (so year+1
would be here 2014+1
). Horizon is self explanatory. Result is:
#df_2014
colour shade year value
20 BLUE DARK 2015 0.6463296
#df_2013
colour shade year value
20 BLUE DARK 2015 0.6532767
I need to use apply
function to list of data frames (in this edit list.df=list(df_2014,df_2013)
as in previous example but this time add subset condition year+horizon
(and possible puts all result in one df, but this is not main issue here).
In conclusion: when you look at both my subset
function in this part in year+horizon
, year
has to change based on which df(from list) in loop it refers (while horizon
is constant).
If you have trouble understanding what I mean please let me know, I tried to be very specific.
Upvotes: 1
Views: 444
Reputation: 76432
The problem seems to be the construct
subset(df,df$colour %in% col & df$shade %in% shade)
You are using subset
, that evaluates the logical expression in the environment of its first argument, df
, and then doing df$shade %in% shade
. This is equivalent to shade %in% shade
, since the df
is the first argument. You should rewrite the function as follows, to use different names will do the trick.
customSubset <- function(DF, COL, SHADE){
subset(DF, colour %in% COL & shade %in% SHADE)
}
Now everything works as expected.
set.seed(5601) # make the results reproducible
df1 <- data.frame(colour = sample(c("RED", "GREEN", "BLUE"), 30, TRUE),
shade = sample(c("LIGHT", "DARK"), 30, TRUE),
value = rnorm(30, sd = 9))
df2 <- data.frame(colour = c(rep("RED",10), rep("BLUE",10), rep("GREEN",10))
,shade=c(rep(c(rep("LIGHT",5),rep("DARK",5)), 3))
, value = runif(30,min=0,max=1))
list.dfs <- list(df1, df2)
customSubset(df1,"BLUE","DARK")
# colour shade value
#5 BLUE DARK 4.288107
#6 BLUE DARK 2.860724
#8 BLUE DARK -10.720379
#10 BLUE DARK -15.407090
#14 BLUE DARK -2.259848
#30 BLUE DARK -18.364494
# apply the function to all df's in the list
# both forms are equivalent
lapply(list.dfs, function(x) customSubset(x, "BLUE", "DARK"))
lapply(list.dfs, customSubset, "BLUE", "DARK")
Upvotes: 2