Chargaff
Chargaff

Reputation: 1572

Frequency count, grouped by two coumns in R

How to count frequencies occurring in two columns ? Sample datas :

> sample <- dput(df)
structure(list(Nom_xp = c("A1FAA", "A1FAJ", "A1FBB", "A1FJA", 
"A1FJR", "A1FRJ"), GB05.x = c(100L, 98L, NA, 100L, 102L, 98L), 
    GB05.1.x = c(100L, 106L, NA, 100L, 102L, 98L), GB18.x = c(175L, 
    173L, 177L, 177L, 173L, 177L), GB18.1.x = c(177L, 175L, 177L, 
    177L, 177L, 177L)), .Names = c("Nom_xp", "GB05.x", "GB05.1.x", 
"GB18.x", "GB18.1.x"), row.names = c(NA, 6L), class = "data.frame")

Counting frequencies :

apply(sample[,2:5],2,table)

Now, how to combine the count by prefix of columns, or by every two columns ? The expected output, for the four first columns would be a list:

$GB05
98 100 102 106
3  4   2   1
$GB18
173 175 177
2   2   8

One way to get the count for the first two columns :

  table(c(apply(sample[,2:3],2,rbind)))
98 100 102 106 
  3   4   2   1

But how to apply this to a whole data.frame ?

Upvotes: 1

Views: 467

Answers (3)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

Here is another answer that is sort of a hybrid between Anthony's answer and juba's answer.

The first step is to convert the data.frame into a "long" data.frame. I generally use stack when I can, but you can also do library(reshape2); df2 <- melt(df) to get output similar to my df2 object.

df2 <- data.frame(df[1], stack(df[-1]))
head(df2)
#   Nom_xp values    ind
# 1  A1FAA    100 GB05.x
# 2  A1FAJ     98 GB05.x
# 3  A1FBB     NA GB05.x
# 4  A1FJA    100 GB05.x
# 5  A1FJR    102 GB05.x
# 6  A1FRJ     98 GB05.x

Next, we need to know the unique values of ind. juba did that with substr, but I've done it here with gsub and a regular expression. We don't need to add that into our data.frame; we can call it directly in our other functions. The two functions which immediately come to mind are by and tapply, and both give you the output you are looking for.

by(df2$values, 
   list(ind = gsub("([A-Z0-9]+)\\..*", "\\1", df2$ind)), 
   FUN=table)
# ind: GB05
# 
#  98 100 102 106 
#   3   4   2   1 
# ------------------------------------------------------------------------------ 
# ind: GB18
# 
# 173 175 177 
#   2   2   8 

tapply(df2$values, gsub("([A-Z0-9]+)\\..*", "\\1", df2$ind), FUN = table)
# $GB05
# 
#  98 100 102 106 
#   3   4   2   1 
# 
# $GB18
# 
# 173 175 177 
#   2   2   8 

Upvotes: 1

juba
juba

Reputation: 49073

If you want to apply table to your whole data frame, you can use :

table(unlist(sample[,-1]))

Which gives :

 98 100 102 106 173 175 177 
  3   4   2   1   2   2   8 

If you want to group by column name prefix, for example the 4th first characters, you can do something like this :

cols <- names(sample)[-1]
groups <- unique(substr(cols,0,4))
sapply(groups, function(name) table(unlist(sample[,grepl(paste0("^",name),names(sample))])))

Which gives :

$GB05

 98 100 102 106 
  3   4   2   1 

$GB18

173 175 177 
  2   2   8 

Upvotes: 2

Anthony Damico
Anthony Damico

Reputation: 6124

i would've said juba's answer was correct, but given you're looking for something else, perhaps it's this?

library(reshape2)

x <- melt( sample[ , 2:5 ] )

table( x[ , c( 'variable' , 'value' ) ] ) 

which gives

          value
variable   98 100 102 106 173 175 177
  GB05.x    2   2   1   0   0   0   0
  GB05.1.x  1   2   1   1   0   0   0
  GB18.x    0   0   0   0   2   1   3
  GB18.1.x  0   0   0   0   0   1   5

please provide an example of your desired output structure :)

Upvotes: 1

Related Questions