tfmunkey
tfmunkey

Reputation: 69

R - Count the number of times similar strings appear in several columns

Following from an earlier problem: R - return boolean if any strings in a vector appear in any of several columns

I didn't think I needed to count the number of similar strings from my vector that appear in my data frame, but it turns out it's useful information. D'oh!

The problem: I have a large data frame of which columns 5 to 24 are diagnosis codes. Each row is an individual admission to hospital. The vector risk_codes contains truncated diagnosis codes. I sought a new column to the data frame that told me if any risk_codes appeared in the 20 diagnosis codes. The catch was that I needed a partial match, not full match.

Col1   Col2   Col3   Col4   Diag_1  Diag_2  Diag_3 ... Diag_20
data   data   data   data   J123    F456    H789       E468
data   data   data   data   T452    NA      NA         NA

The code to do that:

df$newcol <- apply(df,1,function(x) any(sapply(risk_codes, function(codes) grepl(codes,x[c(5:24)]))))
df$newcol <- ifelse(df$newcol,1,0)

This successfully returns 1 to the new column if any risk_codes match the admission's diagnosis codes.

risk_codes <- c("J1","F45","H987")

Col1   Col2   Col3   Col4   Diag_1  Diag_2  Diag_3 ... Diag_20   newcol
data   data   data   data   J123    F456    H789       E468      1
data   data   data   data   T452    NA      NA         NA        0

The additional complication: Now I'd like to count the number of matches, rather than just see that there are matches. It's likely a manipulation of the first line of code presented but I'm struggling to find the logic.

risk_codes <- c("J1","F45","H987")

Col1   Col2   Col3   Col4   Diag_1  Diag_2  Diag_3 ... Diag_20  newcol  count
data   data   data   data   J123    F456    H789       E468     1       2
data   data   data   data   T452    NA      NA         NA       0       0

Upvotes: 0

Views: 1182

Answers (1)

IRTFM
IRTFM

Reputation: 263352

On the assumption that you are referring to columns rather than rows, then this should succeed:

df$code_count <- apply(df,1,function(x) 
                         sum(sapply(risk_codes, function(codes) grepl(codes,x[c(5:24)]))))

Upvotes: 2

Related Questions