A Geneslaw
A Geneslaw

Reputation: 21

R: create single new column based upon matching string in multiple other columns

I have a dataset with a very large number of string column variables representing procedural codes. There is another column of variables that represents the coding format (some are ICD9, some are other more arcane formats). Each observation is a patient. I need to:

  1. Search through each variable name with a particular prefix
  2. Ensure that the code being used is an ICD9 code (represented by "02").
  3. Find which of these codes match the first 3 characters of a particular string
  4. Create a new column variable =1 if any of the variables begin with those three characters, and =0 if none matched

There are too many variables for it to be trivial to create a list of each of them manually via c("cd1", "cd2", ...), and I potentially need to do this many different times so I'd like to make it as general a solution as possible.

Here's a simplified sample, where the strings I need to search for begin with "231":

ID   cd1    type1  cd2    type2  cd3    type3  cd4    type4
1    "231"  "02"   "219"  "02"   "1321" "02"  "2314"  "02"
2    "222"  "02"  
3    "123"  "142"   
4    "145"  "02"   "521"  "02"   "2313" "02"   
5    "231"  "01"

The output I'd like is:

ID   cd1    type1  cd2    type2  cd3    type3  cd4    type4   flag_var
1    "231"  "02"   "219"  "02"   "1321" "02"  "2314"  "02"    1
2    "222"  "02"                                              0
3    "123"  "142"                                             0
4    "145"  "02"   "521"  "02"   "2313" "02"                  1
5    "231"  "01"                                              0

(ID #5 is set to 0 because, even though the cd1 code is "231", the type1 variable is "01" and thus not in the correct coding format")

I've been somewhat successful in accomplishing this using mutate and case_when:

df <- df %>%
  mutate(flag_var = case_when(
    startsWith(cd1, "231") ~ 1, 
    startsWith(cd2, "231") ~ 1, 
    startsWith(cd3, "231") ~ 1, 
    startsWith(cd4, "231") ~ 1, 
    TRUE ~ 0))

Like I said, the actual dataset has too many variables and potential searches for it to make sense to hard-code it in the manner above. I think there should be a good way to do this using mutate_at or another dplyr function, but I haven't been able to figure it out.

I've been able to use the following code to get a set of new variables equal to 1 or 0, but not a single variable. I can then use rowSums to sum across all columns and check if this value is nonzero. But it's ugly and laborious:

df <- df %>% mutate_at(vars(starts_with("cd")),
                       funs(testvar = ifelse(startsWith(., "231"), 1, 0)))
test_names = df %>% select(ends_with("_testvar")) %>% names()
df <- df %>% mutate(flag_var = (rowSums(.[test_names]) == 1))
df <- df %>% select(-ends_with("_testvar"))

Anyone have any simpler ideas? Thanks so much!

EDIT: I realized that I also have to incorporate coding type variables. The initial sample data tables have been edited to reflect this.

Upvotes: 2

Views: 2320

Answers (3)

mt1022
mt1022

Reputation: 17289

Here is another solution:

df$flag_var <- 1*(rowSums(substring(as.matrix(df[, 2:ncol(df)]), 1, 3) == '231') > 0)

#   ID cd1 cd2  cd3  cd4 flag_var
# 1  1 231 219 1321 2314        1
# 2  2 222                      0
# 3  3 123 142                  0
# 4  4 145 521 2313  122        1

For the updated example, since the cd columns and type columns are paired, the following code should work:

cd.cols <- grepl('^cd', colnames(df))
type.cols <- grepl('^type', colnames(df))

flag <- substring(as.matrix(df[,cd.cols]), 1, 3) == '231' & df[,type.cols] == '02'
df$flag_var <- 1 * (rowSums(flag) > 0)

# > df
#   ID cd1 type1 cd2 type2  cd3 type3  cd4 type4 flag_var
# 1  1 231    02 219    02 1321    02 2314    02        1
# 2  2 222    02                                        0
# 3  3 123   142                                        0
# 4  4 145    02 521    02 2313    02                   1
# 5  5 231    01                                        0

Upvotes: 1

akrun
akrun

Reputation: 886938

We could loop through the columns and do a partial match with grepl, Reduce the list of vectors to a single logical vector and coerce the values to binary

df$flag_var <- +(Reduce(`|`, lapply(df[-1], grepl, pattern = '^231')))

data

df <- structure(list(ID = c("1", "2", "3", "4"), cd1 = c("231", "222", 
"123", "145"), cd2 = c("219", "", "142", "521"), cd3 = c("1321", 
 "", "", "2313"), cd4 = c("2314", "", "", "122")), row.names = c(NA, 
 -4L), class = "data.frame")

Upvotes: 1

dylanjm
dylanjm

Reputation: 2101

This might answer question or do you need the 0-1 as row values?

library(tidyverse)

dat <- tribble(~ID,   ~cd1,      ~cd2,      ~cd3,      ~cd4,
        1,    "231",    "219",    "1321",   "2314",
        2,    "222",    ""      , ""    ,   "",
        3,    "123",    "142",    ""    ,   "",
        4,    "145",    "521",    "2313",   "122")

dat %>% 
  gather("cd_type", "code", 2:5) %>% 
  mutate(flag_var = case_when(
    startsWith(code, "231") ~ 1,
    TRUE ~ 0
  ))
#> # A tibble: 16 x 4
#>       ID cd_type code  flag_var
#>    <dbl> <chr>   <chr>    <dbl>
#>  1     1 cd1     231          1
#>  2     2 cd1     222          0
#>  3     3 cd1     123          0
#>  4     4 cd1     145          0
#>  5     1 cd2     219          0
#>  6     2 cd2     ""           0
#>  7     3 cd2     142          0
#>  8     4 cd2     521          0
#>  9     1 cd3     1321         0
#> 10     2 cd3     ""           0
#> 11     3 cd3     ""           0
#> 12     4 cd3     2313         1
#> 13     1 cd4     2314         1
#> 14     2 cd4     ""           0
#> 15     3 cd4     ""           0
#> 16     4 cd4     122          0

Or do this to get back into original wide format

dat %>% 
  gather("cd_type", "code", 2:5) %>% 
  mutate(flag_var = case_when(
    startsWith(code, "231") ~ 1,
    TRUE ~ 0
  )) %>% 
  spread(cd_type, code) %>% 
  select(ID, cd1:cd4, flag_var)
#> # A tibble: 6 x 6
#>      ID cd1   cd2   cd3   cd4   flag_var
#>   <dbl> <chr> <chr> <chr> <chr>    <dbl>
#> 1     1 <NA>  219   1321  <NA>         0
#> 2     1 231   <NA>  <NA>  2314         1
#> 3     2 222   ""    ""    ""           0
#> 4     3 123   142   ""    ""           0
#> 5     4 145   521   <NA>  122          0
#> 6     4 <NA>  <NA>  2313  <NA>         1

Created on 2019-01-19 by the reprex package (v0.2.1)

Upvotes: 1

Related Questions