Luc
Luc

Reputation: 958

R extract first cell by row that meets a criteria

The columns in my data frame are ordered by relevance, the left column being the most relevant one. I am trying to extract the most relevant item that starts with 'D'.

Here is an example:

df <- structure(list(TDIAG1 = structure(c(7L, 2L, 6L, 8L, 4L, 1L, 5L, 
5L, 9L, 3L), .Label = c("D123", "D127", "E611", "E1133", "H269", 
"K701", "K704", "K922", "R0989"), class = "factor"), TDIAG2 = structure(c(7L, 
6L, 5L, 2L, 3L, 6L, 4L, 4L, 1L, 1L), .Label = c("", "D649", "H431", 
"H570", "K703", "D123", "R18"), class = "factor"), TDIAG3 = structure(c(2L, 
6L, 5L, 4L, 3L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "F102", "H333", 
"K296", "K658", "Z720"), class = "factor"), TDIAG4 = structure(c(2L, 
1L, 4L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "E834", "K703", 
"K766"), class = "factor"), TDIAG5 = structure(c(1L, 1L, 3L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "F101", "F102"), class = "factor"), 
    TDIAG6 = structure(c(1L, 1L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 
    1L), .Label = c("", "E877", "Z720"), class = "factor")), .Names = c("TDIAG1", 
"TDIAG2", "TDIAG3", "TDIAG4", "TDIAG5", "TDIAG6"), row.names = c(NA, 
10L), class = "data.frame")


    > df
   TDIAG1 TDIAG2 TDIAG3 TDIAG4 TDIAG5 TDIAG6
1    K704    R18   F102   E834              
2    D127   D123   Z720                     
3    K701   K703   K658   K766   F102   E877
4    K922   D649   K296   K703   F101   Z720
5   E1133   H431   H333                     
6    D123   D123                            
7    H269   H570                            
8    H269   H570                            
9   R0989                                   
10   E611 

The resulting vector should report NA when there is no match, and the first (most left) item when there is a match. I can find the items I am interested in... however, I get stuck in extracting the first (most left) for each row.

> sapply(df,  function (x) grepl("D", x))
      TDIAG1 TDIAG2 TDIAG3 TDIAG4 TDIAG5 TDIAG6
 [1,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
 [2,]   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE
 [3,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
 [4,]  FALSE   TRUE  FALSE  FALSE  FALSE  FALSE
 [5,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
 [6,]   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE
 [7,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
 [8,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
 [9,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
[10,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE

The results should be:

c(NA,"D127", NA, "D649", NA, "D123", NA, NA, NA, NA)

EDIT: How would this be extended if i wanted the pattern to be c("D", "K")? I get an error saying that it only takes the first one. (Answer: change pattern to "D|K")

EDIT 2: Also, how do I extent this when i want to find the most left "D" code per row but exclude codes from a pre-specified list (eg. exclude c("D123", "D090", "D111")?

EDIT 3: I have written a small function that includes all answers. It works well for what I am doing. Perhaps it might benefit someone else at some stage.

Function:

FLAG <- function(data, tomatch, Exact.tomatch=T, Exclude=NA,  Exact.excl=T, Return=c("01", "FirstValue", "Count")){ 
  if(Exact.tomatch == T){tomatch <-paste("^",tomatch,"$", sep="")}
  if(length(tomatch) > 1){tomatch <- paste(tomatch, collapse="|")}
  if(Exact.excl==F){Exclude <- paste(Exclude, collapse="|")}

  out <- NA
  if(is.na(Exclude[1])==T){hits <- vapply(data, grepl, logical(nrow(data)), pattern = tomatch)}
  if(is.na(Exclude[1])!=T & Exact.excl==T){hits <- vapply(lapply(data, function(x) replace(x,x %in% Exclude, NA)), grepl, logical(nrow(data)), pattern = tomatch)}
  if(is.na(Exclude[1])!=T & Exact.excl==F){hits <- vapply(replace(data, vapply(data, grepl, logical(nrow(data)), pattern = Exclude)==T, NA), grepl, logical(nrow(data)), pattern = tomatch)}
  if(Return[1] == "01"){out <- replace(rowSums(hits), rowSums(hits) >1, 1)}
  if(Return[1] == "Count"){out <- rowSums(hits)}
  if(Return[1] == "FirstValue"){out <- data[cbind(seq_len(nrow(data)),replace(max.col(hits,"first"), rowSums(hits)==0, NA))]}
  out
}

The function needs data frame or list as input. Then a vector of what to look for, what to exclude and whether these should be an exact match or not. Finally, it can return the first (most left) match, count of all matches, or just a flag if any of the resulting matches were found.

Example 1. Look in df for any code starting with D or K (do not restrict to exact matching), but exclude K701, K703 and D127 (exact match on these), and return the first (most left) value:

FLAG(data=df, tomatch=c("D", "K"), Exact.tomatch=F, Exclude=c("K701", "K703","D127"),  Exact.excl=T, Return="FirstValue")

Example 2. Look in df for any code starting with D or H (do not restrict to exact matching), but exclude any code that includes H3 (none exact matching), and return the first (most left) value:

FLAG(data=df, tomatch=c("D", "H"), Exact.tomatch=F, Exclude=c("H3"),  Exact.excl=F, Return="FirstValue")

Upvotes: 2

Views: 817

Answers (2)

thelatemail
thelatemail

Reputation: 93938

No need to run across each row necessarily. Running grepl nrow(df) times will be probably much slower than vapply (or even it's slower cousin sapply) running ncol(df) times. E.g.:

hits <- vapply(df, grepl, logical(nrow(df)), pattern = "D")
df[cbind(
  seq_len(nrow(df)),
  replace(max.col(hits,"first"), rowSums(hits)==0, NA)
  )]
#[1] NA     "D127" NA     "D649" NA     "D123" NA     NA     NA     NA 

Benchmarking on a million row data.frame.

df <- df[sample(rownames(df),1e6,replace=TRUE),]
system.time({hits <- vapply(df, grepl, logical(nrow(df)), pattern = "D")
df[cbind(
  seq_len(nrow(df)),
  replace(max.col(hits,"first"), rowSums(hits)==0, NA)
  )]})
#   user  system elapsed 
#  1.904   0.120   2.024 

system.time(apply(df, 1, function(x) grep("D", x, value=T)[1]))
#   user  system elapsed 
# 23.141   0.172  23.317

Upvotes: 4

Se&#241;or O
Se&#241;or O

Reputation: 17432

Since you're operating by rows, you want apply, not sapply

Additionally, using the value = TRUE argument inside of grep will return the actual string you're looking for

> apply(df, 1, function(x) grep("D", x, value=T)[1])
     1      2      3      4      5      6      7      8      9     10 
    NA "D127"     NA "D649"     NA "D123"     NA     NA     NA     NA 

Upvotes: 3

Related Questions