M. Beausoleil
M. Beausoleil

Reputation: 3555

Select rows that contains more than X non-Zero consecutive values

I have a matrix of elements containing years, traits and species. But I want to select only certain species that are present a certain consecutive number of year. I can see this by looking at a table:

mat = matrix(c(2000,2001,2001,2002,2002,2003, 2004, 2005,
               2001,
               2000, 2001, 2002,  2005,
               2000, 2002, 2004, 2004,  2006,
               2,3,3,4,1,2,2,2,
               2,
               1,4,4,3,
               1,4,4,3,2,
               "sp1","sp1","sp1","sp1","sp1","sp1","sp1","sp1",
               "sp2",
               "sp3","sp3","sp3","sp3",
               "sp4","sp4","sp4","sp4","sp4"), nrow = 18)
mat = as.data.frame(mat)
colnames(mat) = c("yr","trait","sp")
res = table(mat$sp,mat$yr)

Here the table looks like this:

      2000 2001 2002 2003 2004 2005 2006
  sp1    1    2    2    1    1    1    0
  sp2    0    1    0    0    0    0    0
  sp3    1    1    1    0    0    1    0
  sp4    1    0    1    0    2    0    1
  1. But here, I want to remove sp2 from my analysis since it was seen only once in 2001 and no other years. Is there a way to do this? I've tried this, but it prints the exact same table:

    res[apply(res,1,function(z) any(z==0)),]
    

    In the end I'd like to delete sp2 from the ´mat´ data, but using the information in the table to delete sp2.

          2000 2001 2002 2003 2004 2005 2006
      sp1    1    2    2    1    1    1    0
      sp3    1    1    1    0    0    1    0
      sp4    1    0    1    0    2    0    1
    

    And the ´mat´ would look like:

         yr trait  sp
    1  2000     2 sp1
    2  2001     3 sp1
    3  2001     3 sp1
    4  2002     4 sp1
    5  2002     1 sp1
    6  2003     2 sp1
    7  2004     2 sp1
    8  2005     2 sp1
    10 2000     1 sp3
    11 2001     4 sp3
    12 2002     4 sp3
    13 2005     3 sp3
    14 2000     1 sp4
    15 2002     4 sp4
    16 2004     4 sp4
    17 2004     3 sp4
    18 2006     2 sp4
    
  2. Also, I want a second command that would allow me to select the individuals from ´mat´ that are seen consecutively in 2 or more years (this would delete sp4, since it was seen at even years only).

    Again, I've tried this, but it's not deleting the right information:

    mat[which(res != 0),]
    

    The end result would be:

          2000 2001 2002 2003 2004 2005 2006
      sp1    1    2    2    1    1    1    0
      sp3    1    1    1    0    0    1    0
    

    And the ´mat´ would look like:

         yr trait  sp
    1  2000     2 sp1
    2  2001     3 sp1
    3  2001     3 sp1
    4  2002     4 sp1
    5  2002     1 sp1
    6  2003     2 sp1
    7  2004     2 sp1
    8  2005     2 sp1
    10 2000     1 sp3
    11 2001     4 sp3
    12 2002     4 sp3
    13 2005     3 sp3
    

This would be applied to a much larger dataset. This is just a small example.

Upvotes: 2

Views: 1893

Answers (3)

Lamia
Lamia

Reputation: 3875

I don't think you need the res table to perform the filtering on your dataframe mat, you can do it directly using dplyr. In order to filter out a given sp that appears only one year, you can do:

library(dplyr)
mat %>% group_by(yr) %>% group_by(sp) %>% filter(n_distinct(yr)>1) %>% ungroup()

  yr  trait     sp
   <fctr> <fctr> <fctr>
1    2000      2    sp1
2    2001      3    sp1
3    2001      3    sp1
4    2002      4    sp1
5    2002      1    sp1
6    2003      2    sp1
7    2004      2    sp1
8    2005      2    sp1
9    2000      1    sp3
10   2001      4    sp3
11   2002      4    sp3
12   2005      3    sp3
13   2000      1    sp4
14   2002      4    sp4
15   2004      4    sp4
16   2004      3    sp4
17   2006      2    sp4

In order to filter out the sps that don't appear two consecutive years, you can do:

mat %>% group_by(sp)%>% filter(min(diff(sort(unique(yr))))==1)

This returns

yr  trait    sp
   <dbl> <fctr> <chr>
1      1      2   sp1
2      2      3   sp1
3      2      3   sp1
4      3      4   sp1
5      3      1   sp1
6      4      2   sp1
7      5      2   sp1
8      6      2   sp1
9      1      1   sp3
10     2      4   sp3
11     3      4   sp3
12     6      3   sp3

Note that this last operation returns a warning as sp2 only has one year. You can combine the two above operations:

mat %>% group_by(yr) %>% group_by(sp) %>% filter(n_distinct(yr)>1) %>% ungroup() %>% group_by(sp)%>% filter(min(diff(sort(unique(yr))))==1)

Which doesn't return the warning.

EDIT: In case you want to filter according to a specified number of consecutive years (not just 2), you could do:

## This function returns the max number of consecutive 1s +1 in a vector, and 0 if there are none or there is just one value in the vector
consec1=function(x){ifelse((1 %in% x),max(rle(x)$lengths[rle(x)$values==1])+1,0)}
## Then use it in your dplyr::filter
mat %>% group_by(sp) %>% filter(consec1(diff(sort(unique(yr))))==6)

Which returns:

 yr  trait     sp
  <dbl> <fctr> <fctr>
1  2000      2    sp1
2  2001      3    sp1
3  2001      3    sp1
4  2002      4    sp1
5  2002      1    sp1
6  2003      2    sp1
7  2004      2    sp1
8  2005      2    sp1

Upvotes: 2

amonk
amonk

Reputation: 1795

  1. I am a bit reluctant but:

     find_zeros<-function(vec){
        bool<-grepl("1{3}",paste(ifelse(vec==0,1,0),collapse = ""),perl = T)  
        return(bool)
     }
    

res[!apply(res,1,find_zeros),]

yielding in the:

       2000 2001 2002 2003 2004 2005 2006
   sp1    1    2    2    1    1    1    0
   sp3    1    1    1    0    0    1    0
   sp4    1    0    1    0    2    0    1

Of course in order to get the trimmed mat one should use:

    mat_trimmed<-mat[(mat$sp %in% row.names(final)),]

Upvotes: 0

Michael Kirchner
Michael Kirchner

Reputation: 889

Replicating your data:

mat = matrix(c(2000,2001,2001,2002,2002,2003, 2004, 2005,
               2001,
               2000, 2001, 2002,  2005,
               2000, 2002, 2004, 2004,  2006,
               2,3,3,4,1,2,2,2,
               2,
               1,4,4,3,
               1,4,4,3,2,
               "sp1","sp1","sp1","sp1","sp1","sp1","sp1","sp1",
               "sp2",
               "sp3","sp3","sp3","sp3",
               "sp4","sp4","sp4","sp4","sp4"), nrow = 18)
mat = as.data.frame(mat)
colnames(mat) = c("yr","trait","sp")
res = table(mat$sp,mat$yr)

Question One:

Multiplying the boolean matrix by 1 will give you something that you can use rowSums() on.

res <- res[which(rowSums(1*(res!=0))>1),]
res

Will give you:

      2000 2001 2002 2003 2004 2005 2006
  sp1    1    2    2    1    1    1    0
  sp3    1    1    1    0    0    1    0
  sp4    1    0    1    0    2    0    1

Question Two:

You can use rle() to detect run lengths.

res <- res[apply(res, 1, function(x) any(rle(x)$lengths > 1)),]
res

Will give you:

      2000 2001 2002 2003 2004 2005 2006
  sp1    1    2    2    1    1    1    0
  sp3    1    1    1    0    0    1    0

Upvotes: 1

Related Questions