GabyLP
GabyLP

Reputation: 3781

R-ddply function over loop variable

I need to loop over a data frame and calculate functions over the variable that is being looped.

A table example:

    table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20), 
    char1=c(rep('a',10), rep('b',10)), 
    target=c(rep(1,10), rep(0,10)))

I create a list of variables:

nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']

And the table that I will populate:

planF<-data.frame(deciles=c(1), min=c(1), max=c(1), pos=c(1))
planF<-planF[-1,]

And the loop:

library(plyr)

for (i in 1:length(nums)){ 
table$deciles<-ntile(table[,nums[i]],5)
plan<-ddply(table, 'deciles', summarize, min=min(nums[i]),
        max=max(nums[i]),pos=sum(target))
planF<-rbind(planF,plan)
}

I need to get the min and max of the variable por each decile. But instead I get:

   deciles  min  max pos
1        1 num1 num1   4
2        2 num2 num2   4
3        3 <NA> <NA>   2
4        4 <NA> <NA>   0
5        5 <NA> <NA>   0
6        1 num1 num1   4
7        2 num2 num2   4
8        3 <NA> <NA>   2
9        4 <NA> <NA>   0
10       5 <NA> <NA>   0

For variable num1 I need to get the result of:

ddply(table, 'deciles', summarize, min=min(num1),
        max=max(num1),pos=sum(target))


  deciles      min       max pos
       1 5.736842  7.157895   0
       2 7.631579  9.052632   0
       3 1.000000 10.000000   2
       4 1.947368  3.368421   4
       5 3.842105  5.263158   4

And below the result of doing the same with num2.

I understand that I need to introduce the variable with the following form:

num1

but the code is writing

'num1'

I tried with:

min=min(as.name(nums[i]))

But I get an error:

Error in min(as.name(nums[i])) : 'type' (symbol) not valid argument

how can I calculate a function over the variable that is being looped?

Upvotes: 0

Views: 708

Answers (2)

rawr
rawr

Reputation: 20811

The gist of your question is to apply a list of functions over the split-apply-combine method, so here is one way you can do this in base r.

## your data
table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20), 
                  char1=c(rep('a',10), rep('b',10)), 
                  target=c(rep(1,10), rep(0,10)))
nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']
table$deciles <- ntile(table[, nums[1]], 5)

FUNS <- list(min = min, max = max, mean = mean)

## split the variable num1 by deciles
## apply each function to each piece
x <- with(table, tapply(num1, deciles, function(x)
  setNames(sapply(FUNS, function(y) y(x)), names(FUNS))))

## combine results
do.call('rbind', x)
#        min       max     mean
# 1 1.000000  2.421053 1.710526
# 2 2.894737  4.315789 3.605263
# 3 4.789474  6.210526 5.500000
# 4 6.684211  8.105263 7.394737
# 5 8.578947 10.000000 9.289474

Instead of using a loop, since we have the above which works and is fairly simple, put it into a function like below

f <- function(num, data = table) {
  FUNS <- list(min = min, max = max, mean = mean)

  x <- tapply(data[, num], data[, 'deciles'], function(x)
    setNames(sapply(FUNS, function(y) y(x)), names(FUNS)))

  cbind(deciles = as.numeric(names(x)), do.call('rbind', x))
}

This way we have the method generalized so it can use any column you have with any data you have. You can call it for individual columns like

f('num1')
f('num2')

Or use a loop to get everything at once

lapply(c('num1','num2'), f)

# [[1]]
#   deciles      min       max     mean
# 1       1 1.000000  2.421053 1.710526
# 2       2 2.894737  4.315789 3.605263
# 3       3 4.789474  6.210526 5.500000
# 4       4 6.684211  8.105263 7.394737
# 5       5 8.578947 10.000000 9.289474
# 
# [[2]]
#   deciles      min      max     mean
# 1       1 20.00000 21.57895 20.78947
# 2       2 22.10526 23.68421 22.89474
# 3       3 24.21053 25.78947 25.00000
# 4       4 26.31579 27.89474 27.10526
# 5       5 28.42105 30.00000 29.21053

If you don't like lapply, you can Vectorize the function to make it a little easier:

Vectorize(f, SIMPLIFY = FALSE)(c('num1', 'num2'))

Which you would more commonly use like this (SIMPLIFY = FALSE to retain the list structures)

v <- Vectorize(f, SIMPLIFY = FALSE)
v(c('num1','num1'))

# $num1
#   deciles      min       max     mean
# 1       1 1.000000  2.421053 1.710526
# 2       2 2.894737  4.315789 3.605263
# 3       3 4.789474  6.210526 5.500000
# 4       4 6.684211  8.105263 7.394737
# 5       5 8.578947 10.000000 9.289474
# 
# $num1
#   deciles      min       max     mean
# 1       1 1.000000  2.421053 1.710526
# 2       2 2.894737  4.315789 3.605263
# 3       3 4.789474  6.210526 5.500000
# 4       4 6.684211  8.105263 7.394737
# 5       5 8.578947 10.000000 9.289474

Upvotes: 1

tchakravarty
tchakravarty

Reputation: 10954

I would strictly prefer to use dplyr for this, even though there is some ugliness in handling string variable names in the call to summarize_ (note the trailing _):

library(lazyeval)
library(dplyr)

# create the data.frame
dfX = data.frame(num1=seq(1,10,len=20),
                 num2=seq(20,30,len=20),
                 char1=c(rep('a',10), rep('b',10)),
                 target=c(rep(1,10), rep(0,10))
)

# select the numeric columns
numericCols = names(dfX)[sapply(dfX, is.numeric)]
numericCols = setdiff(numericCols, "target")

# cycle over numeric columns, creating summary data.frames
liDFY = setNames(
  lapply(
    numericCols, function(x) {
      # compute the quantiles
      quantiles = quantile(dfX[[x]], probs = seq(0, 1, 0.2))

      # create quantile membership
      dfX[["quantile_membership"]] =
        findInterval(dfX[[x]], vec = quantiles,
                     rightmost.closed = TRUE,
                     all.inside = TRUE)

      # summarize variables by decile
      dfX %>%
        group_by(quantile_membership)   %>%
        summarize_(min = interp( ~ min(x_name), x_name = as.name(x)),
                   max = interp( ~ max(x_name), x_name = as.name(x)),
                   mean = interp( ~ mean(x_name), x_name = as.name(x)))
    }),
  numericCols
)

# inspect the output
liDFY[[numericCols[1]]]

Upvotes: 0

Related Questions