Reputation: 2929
I like to write a function using ddply
that outputs the summary statistics based on the name of two columns of data.frame
mat
.
mat
is a big data.frame
with the name of columns "metric", "length", "species", "tree", ...,"index"
index
is factor with 2 levels "Short", "Long"
"metric", "length", "species", "tree"
and others are all continuous variables
Function:
summary1 <- function(arg1,arg2) {
...
ss <- ddply(mat, .(index), function(X) data.frame(
arg1 = as.list(summary(X$arg1)),
arg2 = as.list(summary(X$arg2)),
.parallel = FALSE)
ss
}
I expect the output to look like this after calling summary1("metric","length")
Short metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max.
....
Long metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max.
....
At the moment the function does not produce the desired output? What modification should be made here?
Thanks for your help.
Here is a toy example
mat <- data.frame(
metric = rpois(10,10), length = rpois(10,10), species = rpois(10,10),
tree = rpois(10,10), index = c(rep("Short",5),rep("Long",5))
)
Upvotes: 5
Views: 3428
Reputation: 10891
As ddply is long outdated now, skimr is a quick way to get grouped summary statistics:
> my_skim <- skim_with(numeric = sfl(median))
> mat %>% group_by(index) %>% my_skim
── Data Summary ────────────────────────
Values
Name Piped data
Number of rows 10
Number of columns 5
_______________________
Column type frequency:
numeric 4
________________________
Group variables index
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
skim_variable index n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist median
1 metric Long 0 1 10.2 3.70 5 8 11 13 14 ▃▃▁▃▇ 11
2 metric Short 0 1 10.6 3.21 6 10 11 11 15 ▂▁▇▁▂ 11
3 length Long 0 1 9.8 2.05 8 8 10 10 13 ▇▇▁▁▃ 10
4 length Short 0 1 8.6 1.34 7 8 8 10 10 ▃▇▁▁▇ 8
5 species Long 0 1 8.8 4.09 4 7 8 10 15 ▃▇▃▁▃ 8
6 species Short 0 1 11.4 3.36 7 9 12 14 15 ▃▃▁▃▇ 12
7 tree Long 0 1 8.8 3.83 6 6 7 10 15 ▇▁▂▁▂ 7
8 tree Short 0 1 9 2.55 6 8 9 9 13 ▃▃▇▁▃ 9
The summary statistics shown, like median, can be customized with sfl()
passed into the skim_with
factory.
The resulting summary is in tall form based on grouping variable index
. This is better to work with than many summary columns in a wide format. You can also get the summary dataframe instead of the printed text summary.
Upvotes: 0
Reputation: 50704
As Nick wrote in his answer you can't use $
to reference variable passed as character name. When you wrote X$arg1
then R
search for column named "arg1"
in data.frame
X
. You can reference to it either by X[,arg1]
or X[[arg1]]
.
And if you want nicely named output I propose below solution:
summary1 <- function(arg1, arg2) {
ss <- ddply(mat, .(index), function(X) data.frame(
setNames(
list(as.list(summary(X[[arg1]])), as.list(summary(X[[arg2]]))),
c(arg1,arg2)
)), .parallel = FALSE)
ss
}
summary1("metric","length")
Output for toy data is:
index metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu.
1 Long 5 7 10 8.6 10
2 Short 7 7 9 8.8 10
metric.Max. length.Min. length.1st.Qu. length.Median length.Mean length.3rd.Qu.
1 11 9 10 11 10.8 12
2 11 4 9 9 9.0 11
length.Max.
1 12
2 12
Upvotes: 4
Reputation: 11956
Is this more like what you want?
summary1 <- function(arg1,arg2) {
ss <- ddply(mat, .(index), function(X){ data.frame(
arg1 = as.list(summary(X[,arg1])),
arg2 = as.list(summary(X[,arg2])),
.parallel = FALSE)})
ss
}
Upvotes: 1