Vykta Wakandigara
Vykta Wakandigara

Reputation: 349

How can I get the names of rows in a data.table satisfying a regex condition?

Let's say I have a data.table as follows:

data=data.table(dates=c('04 SEPTEMBER 2018', '05 APR 2018', '7/10/2018'), DT = c('21/07/2010', '3 04 2018', '16 DEC 2018'), amounts = c(21,37,49))

And I would like the names of rows in a data.table satisfying the following grep condition

grepl("^[0-9]{2}\\s[A-Z]{3}\\s[0-9]{4}",x) 

so that I can find rows with dates fulfilling specific date formats. As you can see, both columns dates and DT contain date formats. I do not want to refer to a specific column in the code. I would just like to pick rows containing characters matching the above grep pattern.

I expect the correct code to return c(2,3) since those rows contain the required strings. How can I get the answer that I expect?

Upvotes: 1

Views: 552

Answers (2)

JdeMello
JdeMello

Reputation: 1718

Use lapply() to identity which columns matches the regexpr. Since lapply() outputs a list, use Reduce(`|`, ...) to verify whether at least one of the columns per row satisfy the regexpr condition:

data[
  data[, Reduce(`|`, lapply(.SD, function(x) grepl("^[0-9]{2}\\s[A-Z]{3}\\s[0-9]{4}",x)))], 
  ]

Result:

> data[
+   data[, Reduce(`|`, lapply(.SD, function(x) grepl("^[0-9]{2}\\s[A-Z]{3}\\s[0-9]{4}",x)))], 
+   ]
         dates amounts
1: 05 APR 2018      37
2: 16 DEC 2018      49

Update

To get the indices of the rows for which there is a match (for any column): Use sapply() to get a matrix with the cells indicating whether the match was successful. Then use rowSums(...) > 0 to consolidate it into a single vector with values of TRUE (at least one cell in the row has a successful match) or FALSE (no matches at all). Wrap it all in which() to display the row indices.

Result:

> which(rowSums(sapply(data,function(x) grepl("^[0-9]{2}\\s[A-Z]{3}\\s[0-9]{4}",x))) > 0)
[1] 2 3

Upvotes: 3

Wimpel
Wimpel

Reputation: 27732

an other data.table approach:

ans <- melt( transpose(data), 
             id.vars = NULL, 
             measure.vars = patterns("^V"), 
             variable.name = "row" )
ans <- ans[ grepl("^[0-9]{2}\\s[A-Z]{3}\\s[0-9]{4}", value), ]
ans[, row := gsub("V", "", row)]
    row       value
 1:   2 05 APR 2018
 2:   3 16 DEC 2018

update

data[ seq.int(.N) %in% ans$row, ]
          dates          DT amounts
 1: 05 APR 2018   3 04 2018      37
 2:   7/10/2018 16 DEC 2018      49

Upvotes: 2

Related Questions