Reputation: 191
I am very new to R, and am using it for a statistical analysis of a very large dataset (10 million cases).
In this dataset, there are many variables/columns (Diagnosis1, Diagnosis2...Diagnosis30) that contain categorical codes. Cases are listed as rows. There are 30 columns and about 200 unique categorical codes in the actual dataset. Codes will not appear multiple times within the same case, column number does not imply any importance.
Something like this
Diagnosis1 Diagnosis2 Diagnosis3
001 123 234
456 001 678
123 998 999
What I need is an overall count like
001 2 (x%)
123 2 (x%)
234 1 (y%)
456 1 (y%)
678 1 (y%)
998 1 (y%)
999 1 (y%)
To get the frequencies of the categorical codes appear in the total population.
Upvotes: 1
Views: 15405
Reputation: 18661
One can also use prop.table
to calculate the percentages:
DF = data.frame(table(unlist(df)))
DF$prop = prop.table(DF$Freq)
Result:
Var1 Freq prop
1 1 2 0.2222222
2 123 2 0.2222222
3 234 1 0.1111111
4 456 1 0.1111111
5 678 1 0.1111111
6 998 1 0.1111111
7 999 1 0.1111111
or with dplyr
+ tidyr
:
library(dplyr)
library(tidyr)
df %>%
gather(variable, value) %>%
group_by(value) %>%
summarize(count = n()) %>%
mutate(prop = prop.table(count),
perc = sub("0\\.(\\d{2})\\d+", "\\1%", prop))
Result:
# A tibble: 7 x 4
value count prop perc
<int> <int> <dbl> <chr>
1 1 2 0.2222222 22%
2 123 2 0.2222222 22%
3 234 1 0.1111111 11%
4 456 1 0.1111111 11%
5 678 1 0.1111111 11%
6 998 1 0.1111111 11%
7 999 1 0.1111111 11%
Data:
df = read.table(text = "Diagnosis1 Diagnosis2 Diagnosis3
001 123 234
456 001 678
123 998 999", header = TRUE)
Upvotes: 5
Reputation: 76402
A complete answer following the lines of my comment would be something like this.
dat <- read.table(text = "
Diagnosis1 Diagnosis2 Diagnosis3
001 123 234
456 001 678
123 998 999
", header = TRUE)
count <- table(unlist(dat))
perc <- 100*count/sum(count)
result <- data.frame(code = sprintf("%03d", as.integer(names(count))),
count = as.integer(count), perc = as.numeric(perc))
result
# code count perc
#1 001 2 22.22222
#2 123 2 22.22222
#3 234 1 11.11111
#4 456 1 11.11111
#5 678 1 11.11111
#6 998 1 11.11111
#7 999 1 11.11111
Upvotes: 5