Reputation: 63

R- DataTable count the frequency of categorical variables and display each variable as column

I created a dummy data table called DT. And I am trying to calculate the sum of Capacity(numerical), Count the frequency of Code and State( Categorical) within each ID. For the end result, I want to display the sum of Capacity, frequency of A,B,C... and different State within each unique ID. Therefore, the column name will be ID,total.Cap,A,B,C... AZ,CA..

DT <- data.table(ID = rep(1:500,100),
            Capacity = sample(1:1000, size = 50000, replace =T),
            Code = sample(LETTERS[1:26], 50000, replace = T),
            State = rep(c("AZ","CA","PA","NY","WA","SD"), 50000))


The format of result will like the table below: 
ID total.Cap  A   B   C  ...   AZ  CA ...
1   28123    10   25  70 ...   29  ...
2   32182    20   42  50  ...  30  ...
3

I have tried to to use ddply, melt and dcast.. But the result does not comes out as what I thought. Could anyone give me some hints about how to structure a table looks like this? Thank you!

Upvotes: 2

Answers (2)

David Robinson

Reputation: 78640

You can do this by constructing the totals, state counts, and code counts with three separate data.table statements then joining them. On states and codes, you can use dcast to turn it into one column per state/code with the counts within each.

library(data.table)

totals <- DT[, list(total.Cap = sum(Capacity)), by = "ID"]
states <- dcast(DT, ID ~ State)
codes <- dcast(DT, ID ~ Code)

You can then join the three tables together:

result <- setkey(totals, "ID")[states, ][codes, ]

This results in a table something like:

      ID total.Cap  AZ  CA  NY  PA  SD  WA  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U
  1:   1    287526 200   0   0 200   0 200 12 18 24 42 12 30 30 18  6 36 24  6 18 24 30 24  6 24 36 18 30
  2:   2    293838   0 200 200   0 200   0 18 24 42 30 30 12 24  6 24 12 48 42 18 18 42 24 24 24 12 18 24
  3:   3    279450 200   0   0 200   0 200 24 18 24  6 12 12 18 12 12 30 24 18 54 30  6 42 18 30 24 24 18
  4:   4    298200   0 200 200   0 200   0 30 30 36 30 36 24 24 18 24 18 30 30 30 24  6 30 18  6 18 18 18
  5:   5    294084 200   0   0 200   0 200 18  6 24 12 42 12 18 42 18 18 18 18 24 24 30 18 30 24  6 30 24

Note that if you have many columns like State and Code, you can do all of them at once by melting them first:

# replace State and Code with the categorical variables you want
melted <- melt(DT, measure.vars = c("State", "Code"))
state_codes <- dcast(melted, ID ~ value)
setkey(totals, "ID")[state_codes, ]

Note you still need to join with the totals, and that this will not preserve the order of columns like "states then codes" or vice versa.

Upvotes: 1

treysp

Reputation: 713

This creates the total.Cap, Code, and State summary elements in three separate data tables then merges them by ID:

# Storing intermediate pieces
  total_cap <- DT[, j = list(total.Cap = sum(Capacity)), by = ID]
  code <- dcast(DT[, .N, by = c("ID", "Code")], ID ~ Code, fill = 0)
  state <- dcast(DT[, .N, by = c("ID", "State")], ID ~ State, fill = 0)

  mytable <- merge(total_cap, code, by = "ID")
  mytable <- merge(mytable, state, by = "ID")
  mytable

# As a one-liner
  mytable <- merge(
               merge(DT[, j = list(total.Cap = sum(Capacity)), by = ID],
                     dcast(DT[, .N, by = c("ID", "Code")], ID ~ Code, fill = 0),
                     by = "ID"),
               dcast(DT[, .N, by = c("ID", "State")], ID ~ State, fill = 0),
               by = "ID")
  mytable

Upvotes: 0

R- DataTable count the frequency of categorical variables and display each variable as column

Answers (2)

Related Questions