Hillash
Hillash

Reputation: 85

How to produce summary stats across multiple columns in R?

I have a [1,758 x 38] data frame where each row is a job posting while the columns are skills required for each posting (skill1 to skill38). Most job postings have a number of identical skills, except that they are listed in different columns. I would like to produce summary stats for the skills required (e.g, most common skill required). I can produce this for a single column using data.table:

data[, .N, keyby = skills1] 

But I am unable to implement a looping mechanism to go through each column. How do I do this?

Upvotes: 2

Views: 679

Answers (2)

agenis
agenis

Reputation: 8377

I use apply inside a wrapper function called sumstats that produces the main statistical indicators:

CV = function(x, ...) {sd(x, ...)/mean(x, ...)}
sumstats=function(x, ...) {
  mean.k=function(x) {if (is.numeric(x)) round(mean(x, ...), digits = 2)
                      else "N*N"}
  median.k=function(x) {  if (is.numeric(x)) round(median(x, ...), digits = 2)
                          else "N*N"}
  sd.k=function(x) {  if (is.numeric(x)) round(sd(x, ...), digits = 2)
                      else "N*N"}
  cv.k=function(x) {  if (is.numeric(x)) round(CV(x, ...), digits = 2)
                      else "N*N"}
  min.k=function(x) {  if (is.numeric(x)) round(min(x, ...), digits = 2)
                       else "N*N"}
  max.k=function(x) {  if (is.numeric(x)) round(max(x, ...), digits = 2)
                       else "N*N"}
  sumtable <- cbind(as.matrix(colSums(!is.na(x))), sapply(x,mean.k), sapply(x,median.k), sapply(x,sd.k),  sapply(x,cv.k), sapply(x,min.k), sapply(x,max.k))
  sumtable <- as.data.frame(sumtable);  names(sumtable) <- c("N.obs","Moy","Med","sd","CV", "min","max")
  return(sumtable)
}

The [...] allows you to add a na.rm=T argument

> head(sumstats(mtcars), 3)
###      N.obs    Moy    Med     sd   CV   min    max
### mpg     32  20.09  19.20   6.03 0.30 10.40  33.90
### cyl     32   6.19   6.00   1.79 0.29  4.00   8.00
### disp    32 230.72 196.30 123.94 0.54 71.10 472.00

Note: it doesn't work if you have only one column!

Upvotes: 0

akrun
akrun

Reputation: 886928

You could do this in base R by using lapply to loop over the columns. The output will be a 'list'.

lapply(data, table)

Or @thelatemail mentioned, the 'wide' format can be converted to 'long' with 2 columns and then do the table

library(reshape2)
table(melt(as.matrix(data))[-1])

A similar method using data.table would be

library(data.table)
setDT(melt(as.matrix(data))[-1])[, .N, .(Var2, value)]

Or using mtabulate

library(qdapTools)
mtabulate(data)

Upvotes: 3

Related Questions