Reputation: 3781
I need to loop over a data frame and calculate functions over the variable that is being looped.
A table example:
table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10)))
I create a list of variables:
nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']
And the table that I will populate:
planF<-data.frame(deciles=c(1), min=c(1), max=c(1), pos=c(1))
planF<-planF[-1,]
And the loop:
library(plyr)
for (i in 1:length(nums)){
table$deciles<-ntile(table[,nums[i]],5)
plan<-ddply(table, 'deciles', summarize, min=min(nums[i]),
max=max(nums[i]),pos=sum(target))
planF<-rbind(planF,plan)
}
I need to get the min and max of the variable por each decile. But instead I get:
deciles min max pos
1 1 num1 num1 4
2 2 num2 num2 4
3 3 <NA> <NA> 2
4 4 <NA> <NA> 0
5 5 <NA> <NA> 0
6 1 num1 num1 4
7 2 num2 num2 4
8 3 <NA> <NA> 2
9 4 <NA> <NA> 0
10 5 <NA> <NA> 0
For variable num1 I need to get the result of:
ddply(table, 'deciles', summarize, min=min(num1),
max=max(num1),pos=sum(target))
deciles min max pos
1 5.736842 7.157895 0
2 7.631579 9.052632 0
3 1.000000 10.000000 2
4 1.947368 3.368421 4
5 3.842105 5.263158 4
And below the result of doing the same with num2.
I understand that I need to introduce the variable with the following form:
num1
but the code is writing
'num1'
I tried with:
min=min(as.name(nums[i]))
But I get an error:
Error in min(as.name(nums[i])) : 'type' (symbol) not valid argument
how can I calculate a function over the variable that is being looped?
Upvotes: 0
Views: 708
Reputation: 20811
The gist of your question is to apply a list of functions over the split-apply-combine method, so here is one way you can do this in base r.
## your data
table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10)))
nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']
table$deciles <- ntile(table[, nums[1]], 5)
FUNS <- list(min = min, max = max, mean = mean)
## split the variable num1 by deciles
## apply each function to each piece
x <- with(table, tapply(num1, deciles, function(x)
setNames(sapply(FUNS, function(y) y(x)), names(FUNS))))
## combine results
do.call('rbind', x)
# min max mean
# 1 1.000000 2.421053 1.710526
# 2 2.894737 4.315789 3.605263
# 3 4.789474 6.210526 5.500000
# 4 6.684211 8.105263 7.394737
# 5 8.578947 10.000000 9.289474
Instead of using a loop, since we have the above which works and is fairly simple, put it into a function like below
f <- function(num, data = table) {
FUNS <- list(min = min, max = max, mean = mean)
x <- tapply(data[, num], data[, 'deciles'], function(x)
setNames(sapply(FUNS, function(y) y(x)), names(FUNS)))
cbind(deciles = as.numeric(names(x)), do.call('rbind', x))
}
This way we have the method generalized so it can use any column you have with any data you have. You can call it for individual columns like
f('num1')
f('num2')
Or use a loop to get everything at once
lapply(c('num1','num2'), f)
# [[1]]
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
#
# [[2]]
# deciles min max mean
# 1 1 20.00000 21.57895 20.78947
# 2 2 22.10526 23.68421 22.89474
# 3 3 24.21053 25.78947 25.00000
# 4 4 26.31579 27.89474 27.10526
# 5 5 28.42105 30.00000 29.21053
If you don't like lapply
, you can Vectorize
the function to make it a little easier:
Vectorize(f, SIMPLIFY = FALSE)(c('num1', 'num2'))
Which you would more commonly use like this (SIMPLIFY = FALSE
to retain the list structures)
v <- Vectorize(f, SIMPLIFY = FALSE)
v(c('num1','num1'))
# $num1
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
#
# $num1
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
Upvotes: 1
Reputation: 10954
I would strictly prefer to use dplyr
for this, even though there is some ugliness in handling string variable names in the call to summarize_
(note the trailing _
):
library(lazyeval)
library(dplyr)
# create the data.frame
dfX = data.frame(num1=seq(1,10,len=20),
num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10))
)
# select the numeric columns
numericCols = names(dfX)[sapply(dfX, is.numeric)]
numericCols = setdiff(numericCols, "target")
# cycle over numeric columns, creating summary data.frames
liDFY = setNames(
lapply(
numericCols, function(x) {
# compute the quantiles
quantiles = quantile(dfX[[x]], probs = seq(0, 1, 0.2))
# create quantile membership
dfX[["quantile_membership"]] =
findInterval(dfX[[x]], vec = quantiles,
rightmost.closed = TRUE,
all.inside = TRUE)
# summarize variables by decile
dfX %>%
group_by(quantile_membership) %>%
summarize_(min = interp( ~ min(x_name), x_name = as.name(x)),
max = interp( ~ max(x_name), x_name = as.name(x)),
mean = interp( ~ mean(x_name), x_name = as.name(x)))
}),
numericCols
)
# inspect the output
liDFY[[numericCols[1]]]
Upvotes: 0