Reputation: 349
Let's say I have a data.table as follows:
data=data.table(dates=c('04 SEPTEMBER 2018', '05 APR 2018', '7/10/2018'), DT = c('21/07/2010', '3 04 2018', '16 DEC 2018'), amounts = c(21,37,49))
And I would like the names of rows in a data.table satisfying the following grep condition
grepl("^[0-9]{2}\\s[A-Z]{3}\\s[0-9]{4}",x)
so that I can find rows with dates fulfilling specific date formats. As you can see, both columns dates
and DT
contain date formats. I do not want to refer to a specific column in the code. I would just like to pick rows containing characters matching the above grep pattern.
I expect the correct code to return c(2,3)
since those rows contain the required strings. How can I get the answer that I expect?
Upvotes: 1
Views: 552
Reputation: 1718
Use lapply()
to identity which columns matches the regexpr. Since lapply()
outputs a list, use Reduce(`|`, ...)
to verify whether at least one of the columns per row satisfy the regexpr condition:
data[
data[, Reduce(`|`, lapply(.SD, function(x) grepl("^[0-9]{2}\\s[A-Z]{3}\\s[0-9]{4}",x)))],
]
Result:
> data[
+ data[, Reduce(`|`, lapply(.SD, function(x) grepl("^[0-9]{2}\\s[A-Z]{3}\\s[0-9]{4}",x)))],
+ ]
dates amounts
1: 05 APR 2018 37
2: 16 DEC 2018 49
To get the indices of the rows for which there is a match (for any column): Use sapply()
to get a matrix with the cells indicating whether the match was successful. Then use rowSums(...) > 0
to consolidate it into a single vector with values of TRUE
(at least one cell in the row has a successful match) or FALSE
(no matches at all). Wrap it all in which()
to display the row indices.
Result:
> which(rowSums(sapply(data,function(x) grepl("^[0-9]{2}\\s[A-Z]{3}\\s[0-9]{4}",x))) > 0)
[1] 2 3
Upvotes: 3
Reputation: 27732
an other data.table approach:
ans <- melt( transpose(data),
id.vars = NULL,
measure.vars = patterns("^V"),
variable.name = "row" )
ans <- ans[ grepl("^[0-9]{2}\\s[A-Z]{3}\\s[0-9]{4}", value), ]
ans[, row := gsub("V", "", row)]
row value 1: 2 05 APR 2018 2: 3 16 DEC 2018
update
data[ seq.int(.N) %in% ans$row, ]
dates DT amounts 1: 05 APR 2018 3 04 2018 37 2: 7/10/2018 16 DEC 2018 49
Upvotes: 2