R: Count occurrences of value in multiple columns

Question

I am very new to R, and am using it for a statistical analysis of a very large dataset (10 million cases).

In this dataset, there are many variables/columns (Diagnosis1, Diagnosis2...Diagnosis30) that contain categorical codes. Cases are listed as rows. There are 30 columns and about 200 unique categorical codes in the actual dataset. Codes will not appear multiple times within the same case, column number does not imply any importance.

Something like this

Diagnosis1         Diagnosis2         Diagnosis3
001                123                234
456                001                678
123                998                999

What I need is an overall count like

001 2 (x%)
123 2 (x%)
234 1 (y%)
456 1 (y%)
678 1 (y%)
998 1 (y%)
999 1 (y%)

To get the frequencies of the categorical codes appear in the total population.

Rui Barradas · Accepted Answer

A complete answer following the lines of my comment would be something like this.

dat <- read.table(text = "
Diagnosis1         Diagnosis2         Diagnosis3
001                123                234
456                001                678
123                998                999
", header = TRUE)

count <- table(unlist(dat))
perc <- 100*count/sum(count)

result <- data.frame(code = sprintf("%03d", as.integer(names(count))),
                     count = as.integer(count), perc = as.numeric(perc))

result
#  code count     perc
#1  001     2 22.22222
#2  123     2 22.22222
#3  234     1 11.11111
#4  456     1 11.11111
#5  678     1 11.11111
#6  998     1 11.11111
#7  999     1 11.11111

R: Count occurrences of value in multiple columns

Answers (2)

Related Questions