Prasanna Nandakumar
Prasanna Nandakumar

Reputation: 4335

Summary statistics using apply family for different factor levels

I am trying to find the summary statistics for different factor levels.

data.frame(apply(final_data[Company=="BPO",c(66:84)],2,summary))  

Now I have different values for company - i can repeat the statement for different values. I know it can be automated - using apply family (ddply,tapply,sapply), but I am not getting it right.

Upvotes: 1

Views: 2563

Answers (2)

Thomas
Thomas

Reputation: 44555

You may want to think about using the by or tapply functions. This will allow you to skip the explicit call to split. Here's an example, since you haven't provided data.

# some example data
set.seed(1)
df <- data.frame(x = as.factor(rep(1:5, each=10)), y1=rnorm(50), y2=rnorm(50))

# with `tapply`
a <- do.call(rbind, sapply(df[,2:3], function(i) tapply(i, df$x, summary)))
# with `by`
a <- do.call(rbind, sapply(df[,2:3], function(i) by(i, df$x, summary)))

Here's the output:

> a
         Min.  1st Qu.    Median    Mean 3rd Qu.   Max.
 [1,] -0.8356 -0.54620  0.256600  0.1322  0.5537 1.5950
 [2,] -2.2150 -0.03775  0.491900  0.2488  0.9132 1.5120
 [3,] -1.9890 -0.39760  0.009218 -0.1337  0.5694 0.9190
 [4,] -1.3770 -0.32140 -0.056560  0.1207  0.6693 1.3590
 [5,] -0.7075 -0.23120  0.126100  0.1341  0.6619 0.8811
 [6,] -1.1290 -0.55080  0.103000  0.1435  0.5268 1.9800
 [7,] -1.8050 -0.02243  0.171000  0.4512  1.2720 2.4020
 [8,] -1.2540 -0.67980 -0.221100 -0.2477  0.2372 0.6107
 [9,] -1.5240 -0.26190  0.300000  0.1274  0.5380 1.1780
[10,] -1.2770 -0.56560  0.042540  0.1123  1.0450 1.5870

You might also want to combine this with the variable and level names to know what's going on:

b <- expand.grid(level=levels(df$x),var=names(df[,2:3]))
cbind(a,b)

Here's the output of that:

> cbind(b,a)
   level var    Min.  1st Qu.    Median    Mean 3rd Qu.   Max.
1      1  y1 -0.8356 -0.54620  0.256600  0.1322  0.5537 1.5950
2      2  y1 -2.2150 -0.03775  0.491900  0.2488  0.9132 1.5120
3      3  y1 -1.9890 -0.39760  0.009218 -0.1337  0.5694 0.9190
4      4  y1 -1.3770 -0.32140 -0.056560  0.1207  0.6693 1.3590
5      5  y1 -0.7075 -0.23120  0.126100  0.1341  0.6619 0.8811
6      1  y2 -1.1290 -0.55080  0.103000  0.1435  0.5268 1.9800
7      2  y2 -1.8050 -0.02243  0.171000  0.4512  1.2720 2.4020
8      3  y2 -1.2540 -0.67980 -0.221100 -0.2477  0.2372 0.6107
9      4  y2 -1.5240 -0.26190  0.300000  0.1274  0.5380 1.1780
10     5  y2 -1.2770 -0.56560  0.042540  0.1123  1.0450 1.5870

Upvotes: 2

josliber
josliber

Reputation: 44330

You could split on company and then use your function:

spl = split(final_data, final_data$Company)
list.of.summaries = lapply(spl, function(x) data.frame(apply(x[,66:84], 2, summary)))

Upvotes: 3

Related Questions