Haloom
Haloom

Reputation: 360

how to filter rows between two specific values

I need help filtering the following dataframe (this is a simple example):

mx = as.data.frame(cbind(c("-", "-", "-", "-", "mutation", "+", "+", "+", "+") ,
                         c(F, T, F, F, F, F, T, F,T)) )
colnames(mx) = c("mutation", "distance")
mx
  mutation distance
1        -    FALSE
2        -     TRUE
3        -    FALSE
4        -    FALSE
5 mutation    FALSE
6        +    FALSE
7        +     TRUE
8        +    FALSE
9        +     TRUE

I need to filter based on the second column (distance), so that it looks like this:

  mutation distance
3        -    FALSE
4        -    FALSE
5 mutation    FALSE
6        +    FALSE

I need to remove all rows until the last TRUE that is before the row with the mx$mutation = mutation value (so rows 1 and 2), and all rows after the first TRUE that occurs after mx$mutation = mutation (so row 7 and beyond).

Upvotes: 6

Views: 1850

Answers (3)

Prem
Prem

Reputation: 11955

#sample data (note that I have added few extra rows at the end)
mx = data.frame(mutation = c("-", "-", "-", "-", "mutation", "+", "+", "+", "+", "-", "mutation", "+","+") ,
                distance = c(F, T, F, F, F, F, T, F,T,F,F,F,T))

mutation_idx <- which(mx$mutation=="mutation")
distance_T_idx <- which(mx$distance==T) 
interval_idx <- findInterval(mutation_idx, distance_T_idx)
rows <- lapply(interval_idx, function(x) ((distance_T_idx[x]+1):(distance_T_idx[x+1]-1)))
mx[unlist(rows),]

Output is:

   mutation distance
3         -    FALSE
4         -    FALSE
5  mutation    FALSE
6         +    FALSE
10        -    FALSE
11 mutation    FALSE
12        +    FALSE

Upvotes: 0

Cybernetic
Cybernetic

Reputation: 13334

You can use the which() method to identify rows appropriately:

# get rownum of last TRUE before df$mutation=mutation
last_true_before_mutation <- max(which(mx$distance == 'TRUE')[which(mx$distance == 'TRUE') < which(mx$mutation == 'mutation')])

# get rownum of first TRUE after df$mutation=mutation
first_true_after_mutation <- min(which(mx$distance == 'TRUE')[which(mx$distance == 'TRUE') > which(mx$mutation == 'mutation')])

# all rows to remove 
rem_rows <- c(seq(1:last_true_before_mutation), seq(first_true_after_mutation, nrow(mx)))

# remove approproate rows
mx[-rem_rows, ]

enter image description here

Here is a general-purpose function you can use:

before_after_mutation <- function(df) {
    last_true_before_mutation <- max(which(df$distance == 'TRUE')[which(df$distance == 'TRUE') < which(df$mutation == 'mutation')])
    first_true_after_mutation <- min(which(df$distance == 'TRUE')[which(df$distance == 'TRUE') > which(df$mutation == 'mutation')])
    rem_rows <- c(seq(1:last_true_before_mutation), seq(first_true_after_mutation, nrow(df)))
    res <- df[-rem_rows,]
    return(res)
}

Usage:

before_after_mutation(mx)

enter image description here

Upvotes: 0

akrun
akrun

Reputation: 887118

We can create a grouping variable by doing the cumulative sum of the logical column ('distance') and then do the filter

library(dplyr)
mx %>%
  group_by(grp = cumsum(distance)) %>% 
  filter(any(mutation == "mutation") & !distance) %>%
  ungroup %>% 
  select(-grp)
# A tibble: 4 x 2
# mutation distance
#  <fctr>   <lgl>   
#1 -        F       
#2 -        F       
#3 mutation F       
#4 +        F       

NOTE: We can directly create a data.frame with data.frame. No need for cbind and it would adversely affect the type of the columns as cbind converts to a matrix and matrix can hold only a single type

data

mx = data.frame(c("-", "-", "-", "-", "mutation", "+", "+", "+", "+") ,
                      c(F, T, F, F, F, F, T, F,T)) 

Upvotes: 2

Related Questions