Pokey
Pokey

Reputation: 161

Regular expression to get all rows containing a value within a data.table?

I want to check a data.table for all dates that have a given category, and set all those dates to TRUE in a separate data.table. I have a data.table that looks something like this:

library(data.table)
DT1 = fread(
'Date          Category
2010-01-01    A
2010-01-01    B
2010-01-02    A
2010-01-02    C
2010-01-02    D
2010-01-04    B
2010-01-04    "B OR D"')
DT1[, Date := as.IDate(Date) ]

and I'm trying to extract the values to a data frame that looks like this:

DT2 = fread('
Date       A     B     C     D
2010-01-01 FALSE FALSE FALSE FALSE
2010-01-02 FALSE FALSE FALSE FALSE
2010-01-03 FALSE FALSE FALSE FALSE
2010-01-04 FALSE FALSE FALSE FALSE
2010-01-05 FALSE FALSE FALSE FALSE')
DT2[, Date := as.IDate(Date) ]

If a particular category exists for a given date, I would like to set the cell with the row's date and the category's column to true. I know this will probably involve something like:

DT2 <- DT2[Date %in% DT1$Date, A := grep(something)]

I want this call to modify DT2 to look like this:

Date       A     B     C     D
2010-01-01 TRUE  FALSE FALSE FALSE
2010-01-02 TRUE  FALSE FALSE FALSE
2010-01-03 FALSE FALSE FALSE FALSE
2010-01-04 FALSE FALSE FALSE FALSE
2010-01-05 FALSE FALSE FALSE FALSE

Right now, my code uses any(grep()), which is a problem because it changes DT2$A[4] to TRUE, which I don't want. I would like to use a regular expression, since some of my category cells refer to multiple values. Because of the last cell, both my checks on the value B and the value D should set their dummy variables in 2010-01-04 to true. As the last 2 rows show, there may be one date which contains multiple references to a single category.

Is there a way to do this in data.table? I can use other packages if absolutely necessary.

Note that I can't just use dcast to transform D1 into the proper data.table, because it doesn't contain all the dates I need.

Upvotes: 2

Views: 538

Answers (1)

thelatemail
thelatemail

Reputation: 93813

Here's one attempt created by looping over the possible categories (hat-tip to @Frank for adjustments):

DF1[
  .(seq(min(Date), max(Date) + 1L, by="day")),
  on=.(Date),
  lapply(sapply(c("A","B","C","D"), grepl, x=Category, USE.NAMES=TRUE, simplify=FALSE),any),
  by=.EACHI
]
#         Date     A     B     C     D
#1: 2010-01-01  TRUE  TRUE FALSE FALSE
#2: 2010-01-02  TRUE FALSE  TRUE  TRUE
#3: 2010-01-03 FALSE FALSE FALSE FALSE
#4: 2010-01-04 FALSE  TRUE FALSE  TRUE
#5: 2010-01-05 FALSE FALSE FALSE FALSE

Upvotes: 2

Related Questions