RROBINSON
RROBINSON

Reputation: 191

R: Count occurrences of value in multiple columns

I am very new to R, and am using it for a statistical analysis of a very large dataset (10 million cases).

In this dataset, there are many variables/columns (Diagnosis1, Diagnosis2...Diagnosis30) that contain categorical codes. Cases are listed as rows. There are 30 columns and about 200 unique categorical codes in the actual dataset. Codes will not appear multiple times within the same case, column number does not imply any importance.

Something like this

Diagnosis1         Diagnosis2         Diagnosis3
001                123                234
456                001                678
123                998                999

What I need is an overall count like

001 2 (x%)
123 2 (x%)
234 1 (y%)
456 1 (y%)
678 1 (y%)
998 1 (y%)
999 1 (y%)

To get the frequencies of the categorical codes appear in the total population.

Upvotes: 1

Views: 15405

Answers (2)

acylam
acylam

Reputation: 18661

One can also use prop.table to calculate the percentages:

DF = data.frame(table(unlist(df)))

DF$prop = prop.table(DF$Freq)

Result:

  Var1 Freq   prop
1    1    2 0.2222222
2  123    2 0.2222222
3  234    1 0.1111111
4  456    1 0.1111111
5  678    1 0.1111111
6  998    1 0.1111111
7  999    1 0.1111111

or with dplyr + tidyr:

library(dplyr)
library(tidyr)

df %>%
  gather(variable, value) %>%
  group_by(value) %>%
  summarize(count = n()) %>%
  mutate(prop = prop.table(count),
         perc = sub("0\\.(\\d{2})\\d+", "\\1%", prop))

Result:

# A tibble: 7 x 4
  value count      prop  perc
  <int> <int>     <dbl> <chr>
1     1     2 0.2222222   22%
2   123     2 0.2222222   22%
3   234     1 0.1111111   11%
4   456     1 0.1111111   11%
5   678     1 0.1111111   11%
6   998     1 0.1111111   11%
7   999     1 0.1111111   11%

Data:

df = read.table(text = "Diagnosis1         Diagnosis2         Diagnosis3
                001                123                234
                456                001                678
                123                998                999", header = TRUE)

Upvotes: 5

Rui Barradas
Rui Barradas

Reputation: 76402

A complete answer following the lines of my comment would be something like this.

dat <- read.table(text = "
Diagnosis1         Diagnosis2         Diagnosis3
001                123                234
456                001                678
123                998                999
", header = TRUE)

count <- table(unlist(dat))
perc <- 100*count/sum(count)

result <- data.frame(code = sprintf("%03d", as.integer(names(count))),
                     count = as.integer(count), perc = as.numeric(perc))

result
#  code count     perc
#1  001     2 22.22222
#2  123     2 22.22222
#3  234     1 11.11111
#4  456     1 11.11111
#5  678     1 11.11111
#6  998     1 11.11111
#7  999     1 11.11111

Upvotes: 5

Related Questions