user2543095
user2543095

Reputation: 107

Subsetting data between specific, identifiable rows

let's say I have a dataset like this, in this example with three columns. There are rows with texts, and in between such rows are rows with numbers (the data that I want). If I want the data between two specific text rows, how would you identify the specific rows, and subset the data between the rows? E.g., let's say I wanted the data between row with c's and row with d's.

col1 col2 col3
a    a    a
a    a    a
1    1    1
2    2    2
3    3    3
b    b    b
3    3    3
c    c    c
4    4    4
5    5    5
d    d    d

into

4    4    4
5    5    5

Upvotes: 0

Views: 81

Answers (1)

akrun
akrun

Reputation: 887981

If you need to subset the data based on the identical elements in all the columns, one option would be to paste the rows with interaction, use grep to find the index of rows that match the string, get the sequence of rows between the index rows (:) and subset.

indx1 <-  as.character(interaction(df, sep=''))
indx2 <- grep('ccc|ddd', indx1)
df[(indx2[1]+1):(indx2[2]-1),]
#   col1 col2 col3
#9     4    4    4
#10    5    5    5

In the example provided, there is only instance where this occur. Suppose, you have multiple instances, you could get the corresponding sequences between index of ccc, ddd strings with Map, subset, and rbind the list elements.

 indx1 <-  as.character(interaction(df1, sep=''))
 do.call(rbind,Map(function(x,y) df1[(x+1):(y-1),],
                   grep('ccc', indx1), grep('ddd', indx1)))

Update

I realized that there might be cases in the dataset where the "ccc" rows may not have a corresponding "ddd" row or there are multiple "ddd" rows etc. In that case create an index ("indx2") by converting the dataset to a logical matrix (df2!='c'). Sum each rows (rowSums). The result wil be a numeric vector. Here '0' indicates the row have all 'c's. Negate the index (!) to convert "0" values to "TRUE" and cumsum. Split the dataset (split) to a list after rows before the first ccc row. Check whether there are all "d" (!rowSums(x!='d')) rows in each list element. If there is any row (if(any(..), then subset the dataset (x) based on the index (2:(which(i1)[1]-1)), finally rbind the list elements.

 indx2 <- cumsum(!rowSums(df2!='c'))
 lst <- split(df2[indx2!=0,], indx2[!!indx2])
 res <- do.call(rbind,lapply(lst, function(x) {
              i1 <- !rowSums(x!='d')
              if(any(i1)) x[2:(which(i1)[1]-1),]}))

data

 df <- structure(list(col1 = c("a", "a", "1", "2", "3", "b", "3", "c", 
 "4", "5", "d"), col2 = c("a", "a", "1", "2", "3", "b", "3", "c", 
 "4", "5", "d"), col3 = c("a", "a", "1", "2", "3", "b", "3", "c", 
  "4", "5", "d")), .Names = c("col1", "col2", "col3"), class = 
 "data.frame", row.names = c(NA, -11L))

 df1 <- structure(list(col1 = c("a", "a", "1", "2", "3", "b", "3",
  "c", "4", "5", "d", "3", "b", "3", "c", "4", "5", "d"), col2 = 
  c("a", "a", "1", "2", "3", "b", "3", "c", "4", "5", "d", "3", "b", 
  "3", "c", "4", "5", "d"), col3 = c("a", "a", "1", "2", "3", "b", 
  "3", "c", "4", "5", "d", "3", "b", "3", "c", "4", "5", "d")), .Names 
  = c("col1", "col2", "col3"), class = "data.frame", row.names = c(NA, 
  -18L))

 df2 <- structure(list(col1 = c("a", "a", "1", "2", "3", "b", "3", "c", 
 "4", "5", "d", "3", "b", "3", "c", "4", "5", "d", "4", "5", "d", 
 "c", "4", "5"), col2 = c("a", "a", "1", "2", "3", "b", "3", "c", 
 "4", "5", "d", "3", "b", "3", "c", "4", "5", "d", "4", "5", "d", 
 "c", "4", "5"), col3 = c("a", "a", "1", "2", "3", "b", "3", "c", 
 "4", "5", "d", "3", "b", "3", "c", "4", "5", "d", "4", "5", "d", 
 "c", "4", "5")), .Names = c("col1", "col2", "col3"), class =  
"data.frame", row.names = c(NA, -24L))

Upvotes: 6

Related Questions