Reputation: 25
I just want to start off by saying I am pretty new to coding in general so I might not be using the right terms but Ill try my best, please let me know if something doesnt make sense :)
Basically, I have a set of really badly entered data. There is a comorbidity column/object where a patients whole list of comorbidity is entered as characters (including a whole bunch of other irrelevant data.)
example of what the data looks like: "breast cancer previous alcohol excess ihd cks" "previous breast cancer delirium pvd pulmonary embolus" "af heart failure colon cancer"
I am trying to count the number of comorbidities a patient has. I have a list of what would count as a comorbidity and what wouldn't. My plan (which i dont think is the best) is to use grep to recognise names of comorbidities and create a new object for each group of comorbidity).
For example, under heart failure comorbidity group, anything in the data that says "ihd", "heart failure" or "cardiac failure" would be grouped into heart failure:
heartfailure <- grep("^ihd|heart failure|cardiac failure",
comorb, value=FALSE)
the output comes out as the row number with the specified comorbidity, which I then turn into character. I will do this for every comorbidity group and then calculate the total number of times a row number comes up which would be the total number of comorbidity for the patient (every row in the data represents a patient).
The issue arises with comorbidities that have a previous which should be not included as a comorbidity.
For example, "breast cancer" would be a comorbidity but "previous breast cancer" would not.
I have tried
grep("!previous breast cancer| breast cancer",
comorb, value= FALSE)
but it returns anything with breast cancer in it even if it has a previous before breast cancer.
The other issue is that as data has been entered badly, each row could have a previous that is associated with another comorbidity and not to do with breast cancer (eg. previous alcohol excess) so I would be incorrectly ruling that row out if the condition for ruling out was only "previous", (ie. the previous has to come right before breast cancer for me to rule out the row.)
Is there a solution to this?
Many thanks
Upvotes: 2
Views: 129
Reputation: 33782
It is difficult to provide a complete solution, as we do not have access to either the complete dataset or the list of comorbidity terms. But perhaps we can provide some ideas that might help you to build a solution.
First, when dealing with text in columns, the tidytext package is very useful.
Second, I would suggest trying to work within one data frame. For that you will find the dplyr package useful: in particular the mutate
and case_when
functions.
Here's an example. Using your data:
df1 <- data.frame(patient_id = 1:3,
description = c("breast cancer previous alcohol excess ihd cks",
"previous breast cancer delirium pvd pulmonary embolus",
"af heart failure colon cancer"))
df1
patient_id description
1 1 breast cancer previous alcohol excess ihd cks
2 2 previous breast cancer delirium pvd pulmonary embolus
3 3 af heart failure colon cancer
We can use tidytext::unnest_tokens
to break the description into single words, storing the words in a new column alongside the original text.
Then we can use dplyr::lag
to check whether a word is preceded by the word "previous", and flag the word if it is.
Next, we can use case_when
to define the comorbidity. Here is where you could add as many rules as you like to achieve the desired result.
# install these first
library(dplyr)
library(tidytext)
comorbidities <- df1 %>%
tidytext::unnest_tokens(terms, description, drop = FALSE) %>%
mutate(is_previous = ifelse(lag(terms) == "previous", 1, 0),
comorb = case_when(
terms == "ihd" ~ "heart failure",
terms == "heart" & lead(terms) == "failure" ~ "heart failure",
terms == "breast" & lead(terms) == "cancer" ~ "breast cancer",
terms == "colon" & lead(terms) == "cancer" ~ "colon cancer",
TRUE ~ NA_character_
))
Result:
patient_id description terms is_previous comorb
1 1 breast cancer previous alcohol excess ihd cks breast NA breast cancer
2 1 breast cancer previous alcohol excess ihd cks cancer 0 <NA>
3 1 breast cancer previous alcohol excess ihd cks previous 0 <NA>
4 1 breast cancer previous alcohol excess ihd cks alcohol 1 <NA>
5 1 breast cancer previous alcohol excess ihd cks excess 0 <NA>
6 1 breast cancer previous alcohol excess ihd cks ihd 0 heart failure
7 1 breast cancer previous alcohol excess ihd cks cks 0 <NA>
8 2 previous breast cancer delirium pvd pulmonary embolus previous 0 <NA>
9 2 previous breast cancer delirium pvd pulmonary embolus breast 1 breast cancer
10 2 previous breast cancer delirium pvd pulmonary embolus cancer 0 <NA>
11 2 previous breast cancer delirium pvd pulmonary embolus delirium 0 <NA>
12 2 previous breast cancer delirium pvd pulmonary embolus pvd 0 <NA>
13 2 previous breast cancer delirium pvd pulmonary embolus pulmonary 0 <NA>
14 2 previous breast cancer delirium pvd pulmonary embolus embolus 0 <NA>
15 3 af heart failure colon cancer af 0 <NA>
16 3 af heart failure colon cancer heart 0 heart failure
17 3 af heart failure colon cancer failure 0 <NA>
18 3 af heart failure colon cancer colon 0 colon cancer
19 3 af heart failure colon cancer cancer 0 <NA>
Then you might employ dplyr::filter
to return only the rows you want. For example, to remove rows with no comorbidity, and rows flagged as "previous", then count patients. Note that Patient 2 would not be returned in this case:
comorbidities %>%
filter(!is.na(comorb),
is_previous == 0) %>%
count(patient_id, name = "comorbidities")
patient_id comorbidities
1 1 1
2 3 2
Upvotes: 2