Reputation: 2929

Summary statistics using ddply

I like to write a function using ddply that outputs the summary statistics based on the name of two columns of data.frame mat.

mat is a big data.frame with the name of columns "metric", "length", "species", "tree", ...,"index"
index is factor with 2 levels "Short", "Long"
"metric", "length", "species", "tree" and others are all continuous variables

Function:

summary1 <- function(arg1,arg2) {
    ...

    ss <- ddply(mat, .(index), function(X) data.frame(
        arg1 = as.list(summary(X$arg1)),
        arg2 = as.list(summary(X$arg2)),
        .parallel = FALSE)

    ss
}

I expect the output to look like this after calling summary1("metric","length")

Short metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max. 

....

Long metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max.

....

At the moment the function does not produce the desired output? What modification should be made here?

Thanks for your help.

Here is a toy example

mat <- data.frame(
    metric = rpois(10,10), length = rpois(10,10), species = rpois(10,10),
    tree = rpois(10,10), index = c(rep("Short",5),rep("Long",5))
)

Upvotes: 5

Answers (3)

qwr

Reputation: 10891

As ddply is long outdated now, skimr is a quick way to get grouped summary statistics:

> my_skim <- skim_with(numeric = sfl(median))
> mat %>% group_by(index) %>% my_skim
── Data Summary ────────────────────────
                           Values    
Name                       Piped data
Number of rows             10        
Number of columns          5         
_______________________              
Column type frequency:               
  numeric                  4         
________________________             
Group variables            index     

── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
  skim_variable index n_missing complete_rate mean   sd p0 p25 p50 p75 p100 hist  median
1 metric        Long          0             1 10.2 3.70  5   8  11  13   14 ▃▃▁▃▇     11
2 metric        Short         0             1 10.6 3.21  6  10  11  11   15 ▂▁▇▁▂     11
3 length        Long          0             1  9.8 2.05  8   8  10  10   13 ▇▇▁▁▃     10
4 length        Short         0             1  8.6 1.34  7   8   8  10   10 ▃▇▁▁▇      8
5 species       Long          0             1  8.8 4.09  4   7   8  10   15 ▃▇▃▁▃      8
6 species       Short         0             1 11.4 3.36  7   9  12  14   15 ▃▃▁▃▇     12
7 tree          Long          0             1  8.8 3.83  6   6   7  10   15 ▇▁▂▁▂      7
8 tree          Short         0             1  9   2.55  6   8   9   9   13 ▃▃▇▁▃      9

The summary statistics shown, like median, can be customized with sfl() passed into the skim_with factory.

The resulting summary is in tall form based on grouping variable index. This is better to work with than many summary columns in a wide format. You can also get the summary dataframe instead of the printed text summary.

Upvotes: 0

Marek

Reputation: 50704

As Nick wrote in his answer you can't use $ to reference variable passed as character name. When you wrote X$arg1 then R search for column named "arg1" in data.frame X. You can reference to it either by X[,arg1] or X[[arg1]].

And if you want nicely named output I propose below solution:

summary1 <- function(arg1, arg2) {

    ss <- ddply(mat, .(index), function(X) data.frame(
        setNames(
            list(as.list(summary(X[[arg1]])), as.list(summary(X[[arg2]]))),
            c(arg1,arg2)
            )), .parallel = FALSE)

    ss
}
summary1("metric","length")

Output for toy data is:

  index metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu.
1  Long           5              7            10         8.6             10
2 Short           7              7             9         8.8             10
  metric.Max. length.Min. length.1st.Qu. length.Median length.Mean length.3rd.Qu.
1          11           9             10            11        10.8             12
2          11           4              9             9         9.0             11
  length.Max.
1          12
2          12

Upvotes: 4

Nick Sabbe

Reputation: 11956

Is this more like what you want?

summary1 <- function(arg1,arg2) {
ss <- ddply(mat, .(index), function(X){ data.frame(
    arg1 = as.list(summary(X[,arg1])),
    arg2 = as.list(summary(X[,arg2])),
    .parallel = FALSE)})
ss
}

Upvotes: 1

Summary statistics using ddply

Answers (3)

Related Questions