rfunctiondataframeuser-defined-functionsplyr

Reputation: 307

going nuts trying to write a simple function that operates on one column of a dataframe

I am trying to write a function that "variabilizes" the ddply call:

december <- ddply(adk47, .(PeakName, Elevation), summarize, 
  needThese=if(sum(dec) == 0) "needThis" 
  else character(0), .progress='text')

Where there are 3 letter column names for each month in the df. I am trying to write the function as:

need.fr.month <- function(df, monthCol) {
    needThese <- ddply(df, .(PeakName, Elevation), 
                       summarize, 
                       needThese=if(sum(monthCol) == 0)
                           "needThis" else character(0)
    )
    return(needThese)
}

but when I call this with

need.fr.month(adk47, oct)

or with

need.fr.month(adk47, "oct")

I get the these error messages:

Error in eval(expr, envir, enclos) : object 'monthCol' not found

Error in sum("monthCol") : invalid 'type' (character) of argument

I know that I am not getting something very basic, but I don't know what.

I am using this DF to practice writing R functions. My other functions have gone fairly well; however, this is the first function in which I am trying to variabilize a df column.

Help would be gratefully appreciated.

Here is a Reproduceable Example for a subset of the data

PeakName    Elevation   jul aug sep oct nov dec
Algonquin   5114    0   0   1   0   0   0
Algonquin   5114    0   0   0   0   0   0
Algonquin   5114    0   0   0   1   0   0
Algonquin   5114    1   0   0   0   0   0
Allen   4340    0   0   0   0   0   0
Allen   4340    0   0   0   0   0   0
Allen   4340    0   0   1   0   0   0
Allen   4340    1   0   0   0   0   0
Allen   4340    0   0   0   0   1   0
Armstrong   4400    0   0   0   0   0   0
Armstrong   4400    0   0   0   0   0   0
Armstrong   4400    0   0   0   0   0   0
Armstrong   4400    0   0   0   0   0   0
Armstrong   4400    0   0   0   0   1   0
Armstrong   4400    0   0   0   0   0   0
Armstrong   4400    0   0   0   1   0   0
Basin   4827    1   0   0   0   0   0
Basin   4827    0   0   0   0   0   0
Basin   4827    0   0   0   0   0   0
Basin   4827    0   0   0   0   0   0
Basin   4827    0   0   0   0   0   0
Basin   4827    0   0   0   0   0   0
Basin   4827    0   0   0   0   1   0
Big.Slide   4240    0   0   0   0   0   0
Big.Slide   4240    0   0   0   1   0   0
Big.Slide   4240    0   0   0   0   0   0
Big.Slide   4240    0   0   1   0   0   0
Big.Slide   4240    0   0   0   0   0   0
Big.Slide   4240    0   0   0   0   0   0
Big.Slide   4240    0   0   0   0   0   0
Big.Slide   4240    1   0   0   0   0   0

I hope this suffices. Clearly this is a subset of the data. The form is that each "hike" has one line with the months columns (here truncated to July thru December) indicating a "1" for one month and a zero for the other 11.

Thanks

Wayne

Upvotes: 2

Answers (4)

mnel

Reputation: 115425

I think it would be far easier to create a column for which your indicator variables would be indicator variables (as describie Optimization: splitting dataframe into a list of dataframes, transforming data per row) and then subset from that.

I would advocate using data.table not ddply + summarize for efficiency (but perhaps this is a longer term goal)

Using data.table to access set (which will work on data.frames)

library(data.table)
adk47$monthCol <- character(nrow(adk47))
# data.table specific
# adk47 <- data.table(adk47)
# adk47[, monthCol := character(nrow(adk47))]

# find which columns are == 1
whiches <- lapply(adk47[c("jul", "aug", "sep", "oct", "nov", "dec")],
                  function(x) which(x==1))
# data.table approach would require 
#  adk47[c("jul", "aug", "sep", "oct", "nov", "dec"),with = TRUE]

for(val in names(whiches)){ 
  set(adk47, i = whiches[[val]], j = 'monthCol', value = val)
  }

head(adk47)


       PeakName Elevation jul aug sep oct nov dec monthCol
1 Algonquin      5114   0   0   1   0   0   0      sep
2 Algonquin      5114   0   0   0   0   0   0         
3 Algonquin      5114   0   0   0   1   0   0      oct
4 Algonquin      5114   1   0   0   0   0   0      jul
5     Allen      4340   0   0   0   0   0   0         
6     Allen      4340   0   0   0   0   0   0

You can then subset using monthCol

Upvotes: 2

WGray

Reputation: 307

Thanks all, both of these are very useful.

I went with a modified version of Blue Magister's 2nd example:

need.fr.month <- function(df, monthCol) {
needThese <- ddply(df, .(PeakName, Elevation),
                   function(x) sum(x[[monthCol]]))
subsetNeedThese <- subset(needThese, V1 == 0, select=c(PeakName, Elevation))

}

as it returns exactly what I need and I understand what it is doing. I haven't dealt with attaching and detaching environments before, so I thank croy111 for the example. I will need to read up on this! Likewise, Blue Magister's eval-parse does seem like an easy way for me to do something I really don't understand properly.

I appreciated Blue Magister's comment: "Passing arguments into inner functions can be difficult". I will accept, for now, that this problem goes away if you avoid calling an inner function (such as "summarize") and think about it again next time I run into a problem like this!!

Upvotes: 2

cryo111

Reputation: 4474

As it seems, summarize cannot find objects from the environment that calls ddply. However, you can manually attach this environment to the search path. After the ddply call, you can detach the environment.

Here a quick example - a similar approach should work for you as well.

test_fun=function(team_vec)
{
    attach(environment())
    tmp=ddply(baseball,
              "team",
              summarise,
              duration=(if (unique(team)%in%team_vec) max(year)-min(year) else 0)
             )
    detach(environment())
    tmp
}

test_fun(c("PIT","PHI"))

Upvotes: 2

Blue Magister

Reputation: 13363

When you call

need.fr.month(adk47, oct)

R looks for a variable named oct in your general environment, and finds nothing. Therefore it reports that it is not found.

If you call:

need.fr.month(adk47, "oct")

R attempts to use the string "oct" in place of monthCol. But taking the sum of a character string doesn't make sense, so it throws an error.

Passing arguments into inner functions can be difficult. A quick kludge is by the infamous eval-parse construct. While it gets the job done, it's generally not recommended because there are often simpler methods to do the same job.

need.fr.month <- function(df, monthCol) {
    needThese <- eval(parse(text=paste0("ddply(df, .(PeakName, Elevation), 
                       summarize, 
                       needThese=if(sum(", monthCol, ") == 0)
                           "needThis" else character(0)
                 ")))
    )
    return(needThese)
}

Here, you don't need to eval-parse to get what you want. Just don't use summarize and rely on the base R extraction functions:

need.fr.month <- function(df, monthCol) {
    needThese <- ddply(df, .(PeakName, Elevation), 
                       function(x) sum(x[[monthCol]]))
    return(needThese)
    #return(needThese[needThese[["V1"]] != 0,])
}

I think this approach could be made better, but I can't improve on it further without knowing what you want to do with the information. If you want to find the rows that you'd like to subset, I think it would be better to do something like:

need.fr.month <- function(df, monthCol) {
ave(df[[monthCol]],df[["PeakName"]],df[["Elevation"]],FUN=sum)
}
adk47$need <- need.fr.month(adk47,"dec") == 0

This then gives you a column in your data frame that will let you subset for the data you are looking for, via adk47$need == TRUE.

Upvotes: 3

going nuts trying to write a simple function that operates on one column of a dataframe

Answers (4)

Related Questions