Reputation: 958
The columns in my data frame are ordered by relevance, the left column being the most relevant one. I am trying to extract the most relevant item that starts with 'D'.
Here is an example:
df <- structure(list(TDIAG1 = structure(c(7L, 2L, 6L, 8L, 4L, 1L, 5L,
5L, 9L, 3L), .Label = c("D123", "D127", "E611", "E1133", "H269",
"K701", "K704", "K922", "R0989"), class = "factor"), TDIAG2 = structure(c(7L,
6L, 5L, 2L, 3L, 6L, 4L, 4L, 1L, 1L), .Label = c("", "D649", "H431",
"H570", "K703", "D123", "R18"), class = "factor"), TDIAG3 = structure(c(2L,
6L, 5L, 4L, 3L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "F102", "H333",
"K296", "K658", "Z720"), class = "factor"), TDIAG4 = structure(c(2L,
1L, 4L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "E834", "K703",
"K766"), class = "factor"), TDIAG5 = structure(c(1L, 1L, 3L,
2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "F101", "F102"), class = "factor"),
TDIAG6 = structure(c(1L, 1L, 2L, 3L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("", "E877", "Z720"), class = "factor")), .Names = c("TDIAG1",
"TDIAG2", "TDIAG3", "TDIAG4", "TDIAG5", "TDIAG6"), row.names = c(NA,
10L), class = "data.frame")
> df
TDIAG1 TDIAG2 TDIAG3 TDIAG4 TDIAG5 TDIAG6
1 K704 R18 F102 E834
2 D127 D123 Z720
3 K701 K703 K658 K766 F102 E877
4 K922 D649 K296 K703 F101 Z720
5 E1133 H431 H333
6 D123 D123
7 H269 H570
8 H269 H570
9 R0989
10 E611
The resulting vector should report NA when there is no match, and the first (most left) item when there is a match. I can find the items I am interested in... however, I get stuck in extracting the first (most left) for each row.
> sapply(df, function (x) grepl("D", x))
TDIAG1 TDIAG2 TDIAG3 TDIAG4 TDIAG5 TDIAG6
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] TRUE TRUE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE TRUE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE
[6,] TRUE TRUE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE
The results should be:
c(NA,"D127", NA, "D649", NA, "D123", NA, NA, NA, NA)
EDIT: How would this be extended if i wanted the pattern to be c("D", "K")? I get an error saying that it only takes the first one. (Answer: change pattern to "D|K")
EDIT 2: Also, how do I extent this when i want to find the most left "D" code per row but exclude codes from a pre-specified list (eg. exclude c("D123", "D090", "D111")?
EDIT 3: I have written a small function that includes all answers. It works well for what I am doing. Perhaps it might benefit someone else at some stage.
Function:
FLAG <- function(data, tomatch, Exact.tomatch=T, Exclude=NA, Exact.excl=T, Return=c("01", "FirstValue", "Count")){
if(Exact.tomatch == T){tomatch <-paste("^",tomatch,"$", sep="")}
if(length(tomatch) > 1){tomatch <- paste(tomatch, collapse="|")}
if(Exact.excl==F){Exclude <- paste(Exclude, collapse="|")}
out <- NA
if(is.na(Exclude[1])==T){hits <- vapply(data, grepl, logical(nrow(data)), pattern = tomatch)}
if(is.na(Exclude[1])!=T & Exact.excl==T){hits <- vapply(lapply(data, function(x) replace(x,x %in% Exclude, NA)), grepl, logical(nrow(data)), pattern = tomatch)}
if(is.na(Exclude[1])!=T & Exact.excl==F){hits <- vapply(replace(data, vapply(data, grepl, logical(nrow(data)), pattern = Exclude)==T, NA), grepl, logical(nrow(data)), pattern = tomatch)}
if(Return[1] == "01"){out <- replace(rowSums(hits), rowSums(hits) >1, 1)}
if(Return[1] == "Count"){out <- rowSums(hits)}
if(Return[1] == "FirstValue"){out <- data[cbind(seq_len(nrow(data)),replace(max.col(hits,"first"), rowSums(hits)==0, NA))]}
out
}
The function needs data frame or list as input. Then a vector of what to look for, what to exclude and whether these should be an exact match or not. Finally, it can return the first (most left) match, count of all matches, or just a flag if any of the resulting matches were found.
Example 1. Look in df for any code starting with D or K (do not restrict to exact matching), but exclude K701, K703 and D127 (exact match on these), and return the first (most left) value:
FLAG(data=df, tomatch=c("D", "K"), Exact.tomatch=F, Exclude=c("K701", "K703","D127"), Exact.excl=T, Return="FirstValue")
Example 2. Look in df for any code starting with D or H (do not restrict to exact matching), but exclude any code that includes H3 (none exact matching), and return the first (most left) value:
FLAG(data=df, tomatch=c("D", "H"), Exact.tomatch=F, Exclude=c("H3"), Exact.excl=F, Return="FirstValue")
Upvotes: 2
Views: 817
Reputation: 93938
No need to run across each row necessarily. Running grepl
nrow(df)
times will be probably much slower than vapply
(or even it's slower cousin sapply
) running ncol(df)
times. E.g.:
hits <- vapply(df, grepl, logical(nrow(df)), pattern = "D")
df[cbind(
seq_len(nrow(df)),
replace(max.col(hits,"first"), rowSums(hits)==0, NA)
)]
#[1] NA "D127" NA "D649" NA "D123" NA NA NA NA
Benchmarking on a million row data.frame.
df <- df[sample(rownames(df),1e6,replace=TRUE),]
system.time({hits <- vapply(df, grepl, logical(nrow(df)), pattern = "D")
df[cbind(
seq_len(nrow(df)),
replace(max.col(hits,"first"), rowSums(hits)==0, NA)
)]})
# user system elapsed
# 1.904 0.120 2.024
system.time(apply(df, 1, function(x) grep("D", x, value=T)[1]))
# user system elapsed
# 23.141 0.172 23.317
Upvotes: 4
Reputation: 17432
Since you're operating by rows, you want apply
, not sapply
Additionally, using the value = TRUE
argument inside of grep
will return the actual string you're looking for
> apply(df, 1, function(x) grep("D", x, value=T)[1])
1 2 3 4 5 6 7 8 9 10
NA "D127" NA "D649" NA "D123" NA NA NA NA
Upvotes: 3