EJJ
EJJ

Reputation: 1513

Automated grep() across multiple columns in large dataset in R

EDIT Reproducible example at the bottom...

I am working with a large dataset (pooled NHAMCS from the CDC):

> dim(ed0509) [1] 174020 514

I'm having trouble using grep() to identify rows in a data frame based on patterns in multiple column variables DIAG1 DIAG2 DIAG3 based on a vector list of interest SSTI.list. The condition is that if this pattern is identified in either one of column variables, then I want to pull that row number out to ultimately use this to subset the data to create a new categorical column SSTI.cat in the dataset (0 or 1).

SSTI.list <- c("035", "566", "60883", "6110", "6752", "6751", "680","681","682","683","684","684","685","686", "7048", "70583","7070", "7078", "7079", "7071", "7280", "72886", "7714", "7715", "7854", "9583", "99662", "99762", "9985")

Since I am dealing with a pretty long list >1000s of elements, I'm trying to automate this process using a for loop. The desired output is having new variables that contain the list of rows for each value in the vector SSTI.list. I have mainly having issues running a for loop within grep() and I get the error:

argument 'pattern' has length > 1 and only the first element will be used

What I have tried to do so far is:

diags <- c(ed0509$DIAG1,ed0509$DIAG2,ed0509$DIAG3)

for (i in SSTI.list){ assign(paste("var",i,sep=""),grep(paste("^",i,"",sep=""),diags,value=F)) }

SSTI.comb would be the final list of rows (all of vari) that identified the patterns in SSTI.list from the for loop that would be used to create the categorical variable SSTI.cat

Then used the data.table package to create the categorical variable.

SSTI.comb<-sort(as.numeric(SSTI.comb))

setDT(ed0509)[SSTI.comb,SSTI.cat:=1][,SSTI.cat:=0]

EDIT for reproducibility, sorry about that...

DIAG1=c("00000","4659-","0356-","5664-","771--","7715-","78791")
DIAG2=c("3829-","00000","00000","4659-","7854-","00000","566--")
DIAG3=c("9985-","00000","00000","00000","00000","00000","00000")
df<-data.frame(DIAG1,DIAG2,DIAG3)`

SSTI.list <- c("035","9985","7854","771","7715")

for (i in SSTI.list){
assign(paste("var",i,sep=""),grep(paste("^",i,"",sep=""),diags,value=F))
}

Conceptually I would like to have an output where the new column variable attached to df would indicate that the 1st, 3rd, 5th and 6th rows are identified to satisfy the pattern indicated in SSTI.list

DIAG1 DIAG2 DIAG3 SSTI.cat
1 00000 3829- 9985-        1
2 4659- 00000 00000        0
3 0356- 00000 00000        1
4 5664- 4659- 00000        0
5 771-- 7854- 00000        1
6 7715- 00000 00000        1
7 78791 566-- 00000        0

Upvotes: 1

Views: 3195

Answers (1)

eipi10
eipi10

Reputation: 93851

Here's an example with fake data that I cooked up before you added your data. Let me know if this is what you had in mind:

SSTI.list <- c("035", "566", "60883", "6110", "6752", "6751", "680","681","682","683","684","684",
               "685","686", "7048", "70583","7070", "7078", "7079", "7071", "7280", "72886", 
               "7714", "7715", "7854", "9583", "99662", "99762", "9985")

# Fake data
set.seed(10)
dat = as.data.frame(replicate(5, sample(c(SSTI.list, 1e5:(1e5+1000)),10)), stringsAsFactors=FALSE)
       V1     V2     V3     V4     V5
1  100493 100642 100861 100522 100254
2  100286 100555 100604 100066 100206
3  100409 100087 100767 100145   7048
4  100682 100583 100336 100895 100719
5  100058 100338 100387 100404 100227
6  100202 100410 100695 100737 100136
7  100252 100024 100829 100813   7078
8  100249 100241 100216 100947 100468
9  100600 100378 100758 100671 100076
10 100998 100824 100334 100482 100789
# Match any instance of a pattern within any element of the data
dat[apply(dat, 1, function(i) any(grepl(paste(SSTI.list, collapse="|"), i))),]
      V1     V2     V3     V4     V5
3 100409 100087 100767 100145   7048
4 100682 100583 100336 100895 100719  # "100682 matches "682" in SSTI.list
7 100252 100024 100829 100813   7078
# Match only if a data element is exactly the same as one of the patterns.
dat[apply(dat, 1, function(i) any(grepl(paste(paste0("^",SSTI.list,"$"), collapse="|"), i))),]
      V1     V2     V3     V4   V5
3 100409 100087 100767 100145 7048
7 100252 100024 100829 100813 7078

If you just want the row indices of matching rows:

which(apply(dat, 1, function(i) any(grepl(paste(SSTI.list, collapse="|"), i))))

[1] 3 4 7

Upvotes: 2

Related Questions