Reputation: 23
I am trying to find a method for subsetting or slicing a dataframe based on each occurrence of a certain string appearing in one column/variable - e.g. I would like to delete all rows between two occurrences of the string. This problem is similar to this question BUT the crucial difference is that I have multiple occurrences of the string and would like to delete the rows between each pair of occurrences. I'm an R dunce and I can't find a way to apply the solution to an index of more than two integers in any elegant kind of way.
Say I have the following dataframe:
a <- c("one", "here is a string", "two", "three", "four", "another string", "five", "six", "yet another string", "seven", "last string")
b <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")
c <- c("type1", "type1", "type1", "type1", "type1", "type1", "type2", "type2", "type2", "type2", "type2")
df <- data.frame(a,b,c)
Which gives the following:
print(df)
a b c
1 one a type1
2 here is a string b type1
3 two c type1
4 three d type1
5 four e type1
6 another string f type1
7 five g type2
8 six h type2
9 yet another string i type2
10 seven j type2
11 last string k type2
I would like to subset it so all rows in between and including any iteration of the string 'string', are removed:
a b c
1 one a type1
2 five g type2
8 six h type2
Using the solution accepted in the question I've linked to, I can remove the first set of rows by creating an index of row numbers and using the first two positions in the index:
index = grep("string", df$a)
df[-(ind[1]:ind[2]),]
But what I want to do would also include removing rows between the next pair of integers in my index
df[-(ind[3]:ind[4]),]
My actual index has 128 integers (64 'pairs') so manually extracting the rows as I've done above will be a pain in the neck. My current plan if I can't find an elegant solution is to print the index and manually extract the rows (which, tbh, would probably have been faster than writing this question but would look awful and wouldn't teach me anything):
print(index)
[1] 2 6 9 11
df[-c(2:6, 9:11), ]
Is there a way to loop over each consecutive pair of integers in the index, or another way of doing what I'm trying to do? I'm not a hugely experienced R user and I have scoured SO for what I'm trying to do before creating this example (which I hope adheres to reprex standards; this is the first time I've asked a question).
I have included column 'c' in the reprex, because it reflects the structure of my actual data (one pair of 'string' occurrences in column 'a' for each change in observation for column 'c') and I'm wondering if there's a way to use group_by() with a base sub-setting expression?? But this could be a total red herring; just including it in case it helps.
Upvotes: 2
Views: 471
Reputation: 41
Since I already posted it on twitter, here's a tidyverse-y solution:
df %>% mutate(stringy = grepl("string", a),
seq = cumsum(stringy)) %>%
filter(seq %% 2 == 0, !stringy)
The trick is basically the same - we find which rows have the string you're looking for, then create a way to alternate them (in this case, adding an index with cumsum and then using modulo 2) then filter out the odds plus any last occurrences of the string (which will be the closing indices).
Upvotes: 1
Reputation: 389235
Create a sequence between consecutive pairs of index
using Map
and remove those rows. One way to get consecutive pairs is by using alternate logical values.
df[-unlist(Map(`:`, index[c(TRUE, FALSE)], index[c(FALSE, TRUE)])),]
# a b c
#1 one a type1
#7 five g type2
#8 six h type2
Upvotes: 2