user3354212
user3354212

Reputation: 1112

subset lists based on a condition in r

I have a data frame which looks like:

df = read.table(text="S00001    S00002  S00003  S00004  S00005  S00006  
GG  AA  GG  AA  GG  AG  
CC  TT  TT  TC  TC  TT  
TT  CC  CC  TT  TT  TT  
AA  AA  GG  AA  AG  AA  
TT  CC  CC  TT  TC  TT  
GG  GG  GG  AA  GG  GG", header=T, stringsAsFactors=F)

I would like to count the number of character strings with the same letters (i.e. "AA", "CC", "GG", or "TT") for each row. What I did is to use table() function to count all elements and generated another list based on if the names of lists are "homo". I tried to subset the lists but it didn't work. Here is my scripts:

A <- apply(df,1, function(x) table(x))
B <- apply(df,1, function(x) (names(table(x)) %in% c("AA","CC","GG","TT")))
A[B] ## this didn't work

I expect a data frame would be generated:

2 3
1 3
2 4
4 1
2 3
1 5

appreciate any helps.

Upvotes: 3

Views: 763

Answers (3)

akrun
akrun

Reputation: 886938

We could do this with a single apply

t(apply(df, 1, function(x) {tbl <- table(x)
        tbl[names(tbl) %in% c("AA", "CC", "GG", "TT")]}))
#      [,1] [,2]
#[1,]    2    3
#[2,]    1    3
#[3,]    2    4
#[4,]    4    1
#[5,]    2    3
#[6,]    1    5

Upvotes: 3

Pierre L
Pierre L

Reputation: 28441

Try mapply. It will take each element of the lists sequentially for evaluation. The header names are auto-generated, you can change them as you see fit:

t(mapply('[', A, B))
     AA GG
[1,]  2  3
[2,]  1  3
[3,]  2  4
[4,]  4  1
[5,]  2  3
[6,]  1  5

As mentioned by CathG, you can avoid calculating B with:

t(sapply(A, function(x){x[grepl("([A-Z])\\1", names(x))]}))

Upvotes: 4

David Arenburg
David Arenburg

Reputation: 92282

I don't like apply due to matrix conversion and especially apply(df, 1,...) due to by row operations.

Alternatively, I would suggest and helper function that uses sapply combined with rowSums (which will operate on sapply matrix output)

f <- function(x, y) rowSums(sapply(x, `%in%`, y))

then you could do (without calculating A and B)

cbind(f(df, c("AA", "CC")), 
      f(df, c("GG", "TT")))
#      [,1] [,2]
# [1,]    2    3
# [2,]    1    3
# [3,]    2    4
# [4,]    4    1
# [5,]    2    3
# [6,]    1    5

Or just (depends on what you looking for)

f(df, c("AA", "CC", "GG", "TT"))
# [1] 5 4 6 5 5 6

Upvotes: 3

Related Questions