Reputation: 307
I am trying to write a function that "variabilizes" the ddply call:
december <- ddply(adk47, .(PeakName, Elevation), summarize,
needThese=if(sum(dec) == 0) "needThis"
else character(0), .progress='text')
Where there are 3 letter column names for each month in the df. I am trying to write the function as:
need.fr.month <- function(df, monthCol) {
needThese <- ddply(df, .(PeakName, Elevation),
summarize,
needThese=if(sum(monthCol) == 0)
"needThis" else character(0)
)
return(needThese)
}
but when I call this with
need.fr.month(adk47, oct)
or with
need.fr.month(adk47, "oct")
I get the these error messages:
Error in eval(expr, envir, enclos) : object 'monthCol' not found
or
Error in sum("monthCol") : invalid 'type' (character) of argument
I know that I am not getting something very basic, but I don't know what.
I am using this DF to practice writing R functions. My other functions have gone fairly well; however, this is the first function in which I am trying to variabilize a df column.
Help would be gratefully appreciated.
Here is a Reproduceable Example for a subset of the data
PeakName Elevation jul aug sep oct nov dec
Algonquin 5114 0 0 1 0 0 0
Algonquin 5114 0 0 0 0 0 0
Algonquin 5114 0 0 0 1 0 0
Algonquin 5114 1 0 0 0 0 0
Allen 4340 0 0 0 0 0 0
Allen 4340 0 0 0 0 0 0
Allen 4340 0 0 1 0 0 0
Allen 4340 1 0 0 0 0 0
Allen 4340 0 0 0 0 1 0
Armstrong 4400 0 0 0 0 0 0
Armstrong 4400 0 0 0 0 0 0
Armstrong 4400 0 0 0 0 0 0
Armstrong 4400 0 0 0 0 0 0
Armstrong 4400 0 0 0 0 1 0
Armstrong 4400 0 0 0 0 0 0
Armstrong 4400 0 0 0 1 0 0
Basin 4827 1 0 0 0 0 0
Basin 4827 0 0 0 0 0 0
Basin 4827 0 0 0 0 0 0
Basin 4827 0 0 0 0 0 0
Basin 4827 0 0 0 0 0 0
Basin 4827 0 0 0 0 0 0
Basin 4827 0 0 0 0 1 0
Big.Slide 4240 0 0 0 0 0 0
Big.Slide 4240 0 0 0 1 0 0
Big.Slide 4240 0 0 0 0 0 0
Big.Slide 4240 0 0 1 0 0 0
Big.Slide 4240 0 0 0 0 0 0
Big.Slide 4240 0 0 0 0 0 0
Big.Slide 4240 0 0 0 0 0 0
Big.Slide 4240 1 0 0 0 0 0
I hope this suffices. Clearly this is a subset of the data. The form is that each "hike" has one line with the months columns (here truncated to July thru December) indicating a "1" for one month and a zero for the other 11.
Thanks
Wayne
Upvotes: 2
Views: 366
Reputation: 115425
I think it would be far easier to create a column for which your indicator variables would be indicator variables (as describie Optimization: splitting dataframe into a list of dataframes, transforming data per row) and then subset from that.
I would advocate using data.table
not ddply + summarize
for efficiency (but perhaps this is a longer term goal)
Using data.table
to access set
(which will work on data.frames)
library(data.table)
adk47$monthCol <- character(nrow(adk47))
# data.table specific
# adk47 <- data.table(adk47)
# adk47[, monthCol := character(nrow(adk47))]
# find which columns are == 1
whiches <- lapply(adk47[c("jul", "aug", "sep", "oct", "nov", "dec")],
function(x) which(x==1))
# data.table approach would require
# adk47[c("jul", "aug", "sep", "oct", "nov", "dec"),with = TRUE]
for(val in names(whiches)){
set(adk47, i = whiches[[val]], j = 'monthCol', value = val)
}
head(adk47)
PeakName Elevation jul aug sep oct nov dec monthCol
1 Algonquin 5114 0 0 1 0 0 0 sep
2 Algonquin 5114 0 0 0 0 0 0
3 Algonquin 5114 0 0 0 1 0 0 oct
4 Algonquin 5114 1 0 0 0 0 0 jul
5 Allen 4340 0 0 0 0 0 0
6 Allen 4340 0 0 0 0 0 0
You can then subset using monthCol
Upvotes: 2
Reputation: 307
Thanks all, both of these are very useful.
I went with a modified version of Blue Magister's 2nd example:
need.fr.month <- function(df, monthCol) {
needThese <- ddply(df, .(PeakName, Elevation),
function(x) sum(x[[monthCol]]))
subsetNeedThese <- subset(needThese, V1 == 0, select=c(PeakName, Elevation))
}
as it returns exactly what I need and I understand what it is doing. I haven't dealt with attaching and detaching environments before, so I thank croy111 for the example. I will need to read up on this! Likewise, Blue Magister's eval-parse does seem like an easy way for me to do something I really don't understand properly.
I appreciated Blue Magister's comment: "Passing arguments into inner functions can be difficult". I will accept, for now, that this problem goes away if you avoid calling an inner function (such as "summarize") and think about it again next time I run into a problem like this!!
Upvotes: 2
Reputation: 4474
As it seems, summarize
cannot find objects from the environment that calls ddply
. However, you can manually attach this environment to the search path. After the ddply
call, you can detach the environment.
Here a quick example - a similar approach should work for you as well.
test_fun=function(team_vec)
{
attach(environment())
tmp=ddply(baseball,
"team",
summarise,
duration=(if (unique(team)%in%team_vec) max(year)-min(year) else 0)
)
detach(environment())
tmp
}
test_fun(c("PIT","PHI"))
Upvotes: 2
Reputation: 13363
When you call
need.fr.month(adk47, oct)
R looks for a variable named oct
in your general environment, and finds nothing. Therefore it reports that it is not found.
If you call:
need.fr.month(adk47, "oct")
R attempts to use the string "oct"
in place of monthCol
. But taking the sum
of a character string doesn't make sense, so it throws an error.
Passing arguments into inner functions can be difficult. A quick kludge is by the infamous eval-parse construct. While it gets the job done, it's generally not recommended because there are often simpler methods to do the same job.
need.fr.month <- function(df, monthCol) {
needThese <- eval(parse(text=paste0("ddply(df, .(PeakName, Elevation),
summarize,
needThese=if(sum(", monthCol, ") == 0)
"needThis" else character(0)
")))
)
return(needThese)
}
Here, you don't need to eval-parse to get what you want. Just don't use summarize
and rely on the base R extraction functions:
need.fr.month <- function(df, monthCol) {
needThese <- ddply(df, .(PeakName, Elevation),
function(x) sum(x[[monthCol]]))
return(needThese)
#return(needThese[needThese[["V1"]] != 0,])
}
I think this approach could be made better, but I can't improve on it further without knowing what you want to do with the information. If you want to find the rows that you'd like to subset, I think it would be better to do something like:
need.fr.month <- function(df, monthCol) {
ave(df[[monthCol]],df[["PeakName"]],df[["Elevation"]],FUN=sum)
}
adk47$need <- need.fr.month(adk47,"dec") == 0
This then gives you a column in your data frame that will let you subset for the data you are looking for, via adk47$need == TRUE
.
Upvotes: 3