Reputation: 3555
I have a matrix of elements containing years, traits and species. But I want to select only certain species that are present a certain consecutive number of year. I can see this by looking at a table:
mat = matrix(c(2000,2001,2001,2002,2002,2003, 2004, 2005,
2001,
2000, 2001, 2002, 2005,
2000, 2002, 2004, 2004, 2006,
2,3,3,4,1,2,2,2,
2,
1,4,4,3,
1,4,4,3,2,
"sp1","sp1","sp1","sp1","sp1","sp1","sp1","sp1",
"sp2",
"sp3","sp3","sp3","sp3",
"sp4","sp4","sp4","sp4","sp4"), nrow = 18)
mat = as.data.frame(mat)
colnames(mat) = c("yr","trait","sp")
res = table(mat$sp,mat$yr)
Here the table looks like this:
2000 2001 2002 2003 2004 2005 2006
sp1 1 2 2 1 1 1 0
sp2 0 1 0 0 0 0 0
sp3 1 1 1 0 0 1 0
sp4 1 0 1 0 2 0 1
But here, I want to remove sp2 from my analysis since it was seen only once in 2001 and no other years. Is there a way to do this? I've tried this, but it prints the exact same table:
res[apply(res,1,function(z) any(z==0)),]
In the end I'd like to delete sp2 from the ´mat´ data, but using the information in the table to delete sp2.
2000 2001 2002 2003 2004 2005 2006
sp1 1 2 2 1 1 1 0
sp3 1 1 1 0 0 1 0
sp4 1 0 1 0 2 0 1
And the ´mat´ would look like:
yr trait sp
1 2000 2 sp1
2 2001 3 sp1
3 2001 3 sp1
4 2002 4 sp1
5 2002 1 sp1
6 2003 2 sp1
7 2004 2 sp1
8 2005 2 sp1
10 2000 1 sp3
11 2001 4 sp3
12 2002 4 sp3
13 2005 3 sp3
14 2000 1 sp4
15 2002 4 sp4
16 2004 4 sp4
17 2004 3 sp4
18 2006 2 sp4
Also, I want a second command that would allow me to select the individuals from ´mat´ that are seen consecutively in 2 or more years (this would delete sp4, since it was seen at even years only).
Again, I've tried this, but it's not deleting the right information:
mat[which(res != 0),]
The end result would be:
2000 2001 2002 2003 2004 2005 2006
sp1 1 2 2 1 1 1 0
sp3 1 1 1 0 0 1 0
And the ´mat´ would look like:
yr trait sp
1 2000 2 sp1
2 2001 3 sp1
3 2001 3 sp1
4 2002 4 sp1
5 2002 1 sp1
6 2003 2 sp1
7 2004 2 sp1
8 2005 2 sp1
10 2000 1 sp3
11 2001 4 sp3
12 2002 4 sp3
13 2005 3 sp3
This would be applied to a much larger dataset. This is just a small example.
Upvotes: 2
Views: 1893
Reputation: 3875
I don't think you need the res table
to perform the filtering on your dataframe mat
, you can do it directly using dplyr
.
In order to filter out a given sp
that appears only one year, you can do:
library(dplyr)
mat %>% group_by(yr) %>% group_by(sp) %>% filter(n_distinct(yr)>1) %>% ungroup()
yr trait sp
<fctr> <fctr> <fctr>
1 2000 2 sp1
2 2001 3 sp1
3 2001 3 sp1
4 2002 4 sp1
5 2002 1 sp1
6 2003 2 sp1
7 2004 2 sp1
8 2005 2 sp1
9 2000 1 sp3
10 2001 4 sp3
11 2002 4 sp3
12 2005 3 sp3
13 2000 1 sp4
14 2002 4 sp4
15 2004 4 sp4
16 2004 3 sp4
17 2006 2 sp4
In order to filter out the sps that don't appear two consecutive years, you can do:
mat %>% group_by(sp)%>% filter(min(diff(sort(unique(yr))))==1)
This returns
yr trait sp
<dbl> <fctr> <chr>
1 1 2 sp1
2 2 3 sp1
3 2 3 sp1
4 3 4 sp1
5 3 1 sp1
6 4 2 sp1
7 5 2 sp1
8 6 2 sp1
9 1 1 sp3
10 2 4 sp3
11 3 4 sp3
12 6 3 sp3
Note that this last operation returns a warning as sp2 only has one year. You can combine the two above operations:
mat %>% group_by(yr) %>% group_by(sp) %>% filter(n_distinct(yr)>1) %>% ungroup() %>% group_by(sp)%>% filter(min(diff(sort(unique(yr))))==1)
Which doesn't return the warning.
EDIT: In case you want to filter according to a specified number of consecutive years (not just 2), you could do:
## This function returns the max number of consecutive 1s +1 in a vector, and 0 if there are none or there is just one value in the vector
consec1=function(x){ifelse((1 %in% x),max(rle(x)$lengths[rle(x)$values==1])+1,0)}
## Then use it in your dplyr::filter
mat %>% group_by(sp) %>% filter(consec1(diff(sort(unique(yr))))==6)
Which returns:
yr trait sp
<dbl> <fctr> <fctr>
1 2000 2 sp1
2 2001 3 sp1
3 2001 3 sp1
4 2002 4 sp1
5 2002 1 sp1
6 2003 2 sp1
7 2004 2 sp1
8 2005 2 sp1
Upvotes: 2
Reputation: 1795
I am a bit reluctant but:
find_zeros<-function(vec){
bool<-grepl("1{3}",paste(ifelse(vec==0,1,0),collapse = ""),perl = T)
return(bool)
}
res[!apply(res,1,find_zeros),]
yielding in the:
2000 2001 2002 2003 2004 2005 2006
sp1 1 2 2 1 1 1 0
sp3 1 1 1 0 0 1 0
sp4 1 0 1 0 2 0 1
Of course in order to get the trimmed mat
one should use:
mat_trimmed<-mat[(mat$sp %in% row.names(final)),]
Upvotes: 0
Reputation: 889
Replicating your data:
mat = matrix(c(2000,2001,2001,2002,2002,2003, 2004, 2005,
2001,
2000, 2001, 2002, 2005,
2000, 2002, 2004, 2004, 2006,
2,3,3,4,1,2,2,2,
2,
1,4,4,3,
1,4,4,3,2,
"sp1","sp1","sp1","sp1","sp1","sp1","sp1","sp1",
"sp2",
"sp3","sp3","sp3","sp3",
"sp4","sp4","sp4","sp4","sp4"), nrow = 18)
mat = as.data.frame(mat)
colnames(mat) = c("yr","trait","sp")
res = table(mat$sp,mat$yr)
Multiplying the boolean matrix by 1 will give you something that you can use rowSums()
on.
res <- res[which(rowSums(1*(res!=0))>1),]
res
Will give you:
2000 2001 2002 2003 2004 2005 2006
sp1 1 2 2 1 1 1 0
sp3 1 1 1 0 0 1 0
sp4 1 0 1 0 2 0 1
You can use rle()
to detect run lengths.
res <- res[apply(res, 1, function(x) any(rle(x)$lengths > 1)),]
res
Will give you:
2000 2001 2002 2003 2004 2005 2006
sp1 1 2 2 1 1 1 0
sp3 1 1 1 0 0 1 0
Upvotes: 1