Reputation: 107
let's say I have a dataset like this, in this example with three columns. There are rows with texts, and in between such rows are rows with numbers (the data that I want). If I want the data between two specific text rows, how would you identify the specific rows, and subset the data between the rows? E.g., let's say I wanted the data between row with c's and row with d's.
col1 col2 col3
a a a
a a a
1 1 1
2 2 2
3 3 3
b b b
3 3 3
c c c
4 4 4
5 5 5
d d d
into
4 4 4
5 5 5
Upvotes: 0
Views: 81
Reputation: 887981
If you need to subset the data based on the identical elements in all the columns, one option would be to paste the rows with interaction
, use grep
to find the index of rows that match the string, get the sequence of rows between the index rows (:
) and subset.
indx1 <- as.character(interaction(df, sep=''))
indx2 <- grep('ccc|ddd', indx1)
df[(indx2[1]+1):(indx2[2]-1),]
# col1 col2 col3
#9 4 4 4
#10 5 5 5
In the example provided, there is only instance where this occur. Suppose, you have multiple instances, you could get the corresponding sequences between index of ccc
, ddd
strings with Map
, subset, and rbind
the list elements.
indx1 <- as.character(interaction(df1, sep=''))
do.call(rbind,Map(function(x,y) df1[(x+1):(y-1),],
grep('ccc', indx1), grep('ddd', indx1)))
I realized that there might be cases in the dataset where the "ccc" rows may not have a corresponding "ddd" row or there are multiple "ddd" rows etc. In that case create an index ("indx2") by converting the dataset to a logical matrix (df2!='c'
). Sum each rows (rowSums
). The result wil be a numeric vector. Here '0' indicates the row have all 'c's. Negate the index (!
) to convert "0" values to "TRUE" and cumsum
. Split the dataset (split
) to a list after rows before the first ccc
row. Check whether there are all "d" (!rowSums(x!='d')
) rows in each list element. If there is any row (if(any(..
), then subset the dataset (x
) based on the index (2:(which(i1)[1]-1)
), finally rbind
the list elements.
indx2 <- cumsum(!rowSums(df2!='c'))
lst <- split(df2[indx2!=0,], indx2[!!indx2])
res <- do.call(rbind,lapply(lst, function(x) {
i1 <- !rowSums(x!='d')
if(any(i1)) x[2:(which(i1)[1]-1),]}))
df <- structure(list(col1 = c("a", "a", "1", "2", "3", "b", "3", "c",
"4", "5", "d"), col2 = c("a", "a", "1", "2", "3", "b", "3", "c",
"4", "5", "d"), col3 = c("a", "a", "1", "2", "3", "b", "3", "c",
"4", "5", "d")), .Names = c("col1", "col2", "col3"), class =
"data.frame", row.names = c(NA, -11L))
df1 <- structure(list(col1 = c("a", "a", "1", "2", "3", "b", "3",
"c", "4", "5", "d", "3", "b", "3", "c", "4", "5", "d"), col2 =
c("a", "a", "1", "2", "3", "b", "3", "c", "4", "5", "d", "3", "b",
"3", "c", "4", "5", "d"), col3 = c("a", "a", "1", "2", "3", "b",
"3", "c", "4", "5", "d", "3", "b", "3", "c", "4", "5", "d")), .Names
= c("col1", "col2", "col3"), class = "data.frame", row.names = c(NA,
-18L))
df2 <- structure(list(col1 = c("a", "a", "1", "2", "3", "b", "3", "c",
"4", "5", "d", "3", "b", "3", "c", "4", "5", "d", "4", "5", "d",
"c", "4", "5"), col2 = c("a", "a", "1", "2", "3", "b", "3", "c",
"4", "5", "d", "3", "b", "3", "c", "4", "5", "d", "4", "5", "d",
"c", "4", "5"), col3 = c("a", "a", "1", "2", "3", "b", "3", "c",
"4", "5", "d", "3", "b", "3", "c", "4", "5", "d", "4", "5", "d",
"c", "4", "5")), .Names = c("col1", "col2", "col3"), class =
"data.frame", row.names = c(NA, -24L))
Upvotes: 6