Danielle
Danielle

Reputation: 25

using grep to find "cancer" but exclude "previous cancer"

I just want to start off by saying I am pretty new to coding in general so I might not be using the right terms but Ill try my best, please let me know if something doesnt make sense :)

Basically, I have a set of really badly entered data. There is a comorbidity column/object where a patients whole list of comorbidity is entered as characters (including a whole bunch of other irrelevant data.)

example of what the data looks like: "breast cancer previous alcohol excess ihd cks" "previous breast cancer delirium pvd pulmonary embolus" "af heart failure colon cancer"

I am trying to count the number of comorbidities a patient has. I have a list of what would count as a comorbidity and what wouldn't. My plan (which i dont think is the best) is to use grep to recognise names of comorbidities and create a new object for each group of comorbidity).

For example, under heart failure comorbidity group, anything in the data that says "ihd", "heart failure" or "cardiac failure" would be grouped into heart failure:

heartfailure <- grep("^ihd|heart failure|cardiac failure",
     comorb, value=FALSE)

the output comes out as the row number with the specified comorbidity, which I then turn into character. I will do this for every comorbidity group and then calculate the total number of times a row number comes up which would be the total number of comorbidity for the patient (every row in the data represents a patient).

The issue arises with comorbidities that have a previous which should be not included as a comorbidity.

For example, "breast cancer" would be a comorbidity but "previous breast cancer" would not.

I have tried

grep("!previous breast cancer| breast cancer",
     comorb, value= FALSE)

but it returns anything with breast cancer in it even if it has a previous before breast cancer.

The other issue is that as data has been entered badly, each row could have a previous that is associated with another comorbidity and not to do with breast cancer (eg. previous alcohol excess) so I would be incorrectly ruling that row out if the condition for ruling out was only "previous", (ie. the previous has to come right before breast cancer for me to rule out the row.)

Is there a solution to this?

Many thanks

Upvotes: 2

Views: 129

Answers (1)

neilfws
neilfws

Reputation: 33782

It is difficult to provide a complete solution, as we do not have access to either the complete dataset or the list of comorbidity terms. But perhaps we can provide some ideas that might help you to build a solution.

First, when dealing with text in columns, the tidytext package is very useful.

Second, I would suggest trying to work within one data frame. For that you will find the dplyr package useful: in particular the mutate and case_when functions.

Here's an example. Using your data:

df1 <- data.frame(patient_id = 1:3,
                  description = c("breast cancer previous alcohol excess ihd cks",
                                  "previous breast cancer delirium pvd pulmonary embolus",
                                  "af heart failure colon cancer"))
df1

  patient_id                                           description
1          1         breast cancer previous alcohol excess ihd cks
2          2 previous breast cancer delirium pvd pulmonary embolus
3          3                         af heart failure colon cancer

We can use tidytext::unnest_tokens to break the description into single words, storing the words in a new column alongside the original text.

Then we can use dplyr::lag to check whether a word is preceded by the word "previous", and flag the word if it is.

Next, we can use case_when to define the comorbidity. Here is where you could add as many rules as you like to achieve the desired result.

# install these first
library(dplyr)
library(tidytext)

comorbidities <- df1 %>% 
  tidytext::unnest_tokens(terms, description, drop = FALSE) %>% 
  mutate(is_previous = ifelse(lag(terms) == "previous", 1, 0),
         comorb = case_when(
           terms == "ihd" ~ "heart failure",
           terms == "heart" & lead(terms) == "failure" ~ "heart failure",
           terms == "breast" & lead(terms) == "cancer" ~ "breast cancer",
           terms == "colon" & lead(terms) == "cancer" ~ "colon cancer",
           TRUE ~ NA_character_
         ))

Result:

   patient_id                                           description     terms is_previous        comorb
1           1         breast cancer previous alcohol excess ihd cks    breast          NA breast cancer
2           1         breast cancer previous alcohol excess ihd cks    cancer           0          <NA>
3           1         breast cancer previous alcohol excess ihd cks  previous           0          <NA>
4           1         breast cancer previous alcohol excess ihd cks   alcohol           1          <NA>
5           1         breast cancer previous alcohol excess ihd cks    excess           0          <NA>
6           1         breast cancer previous alcohol excess ihd cks       ihd           0 heart failure
7           1         breast cancer previous alcohol excess ihd cks       cks           0          <NA>
8           2 previous breast cancer delirium pvd pulmonary embolus  previous           0          <NA>
9           2 previous breast cancer delirium pvd pulmonary embolus    breast           1 breast cancer
10          2 previous breast cancer delirium pvd pulmonary embolus    cancer           0          <NA>
11          2 previous breast cancer delirium pvd pulmonary embolus  delirium           0          <NA>
12          2 previous breast cancer delirium pvd pulmonary embolus       pvd           0          <NA>
13          2 previous breast cancer delirium pvd pulmonary embolus pulmonary           0          <NA>
14          2 previous breast cancer delirium pvd pulmonary embolus   embolus           0          <NA>
15          3                         af heart failure colon cancer        af           0          <NA>
16          3                         af heart failure colon cancer     heart           0 heart failure
17          3                         af heart failure colon cancer   failure           0          <NA>
18          3                         af heart failure colon cancer     colon           0  colon cancer
19          3                         af heart failure colon cancer    cancer           0          <NA>

Then you might employ dplyr::filter to return only the rows you want. For example, to remove rows with no comorbidity, and rows flagged as "previous", then count patients. Note that Patient 2 would not be returned in this case:

comorbidities %>% 
  filter(!is.na(comorb), 
         is_previous == 0) %>% 
  count(patient_id, name = "comorbidities")

  patient_id comorbidities
1          1             1
2          3             2

Upvotes: 2

Related Questions