Reputation: 3
I have a very large dataset (189k) with >250 variables where the variables have been inserted using web-based tick boxes. Some variables have then been combined into single lines such as medical comorbidities:
Chronic Pulmonary Disease|Congestive Cardiac Failure|Hypertension|Diabetes|Obesity (BMI>35)|Ischaemic Heart Disease
This variable has ~1500 combinations of medical conditions like the above line. I want to mutate into separate columns for example Col 1: Hypertension (Yes/No), Col 2: Diabetes (Yes/No) ... so that the presence or absence of a pre-existing condition can be used as predictive variables.
Is there a way to code this in R?
Upvotes: 0
Views: 62
Reputation: 4169
It looks to me like you have data where there is a variable of all medical conditions, separated by |
— something like:
Patient Comorbids
1: D88310 Diabetes|Obesity (BMI>35)
2: B9939 <NA>
3: J3923 Hypertension|Obesity (BMI>35)
4: H09203 Hypertension|Diabetes|Chronic Pulmonary Disease
Using the tstrsplit()
function from the data.table package to split that up and grepl()
you can score the presence of each disease in each patient as:
# Remove braces (pay attention to these sorts of issues)
data1[, Comorbids := gsub("\\(|\\)", "", Comorbids)]
# Split the strings into individual values - unique used to find all unique values
conditions <- unique(unlist(tstrsplit(data1[, Comorbids], "\\|")))
conditions <- conditions[!is.na(conditions)]
# Score the occurence and add on to data
data2 <- data.table(data1[, -c("Comorbids")],
sapply(conditions, grepl, data1[, Comorbids]))
Giving:
Patient Diabetes Hypertension Obesity BMI>35 Chronic Pulmonary Disease
1: D88310 TRUE FALSE TRUE FALSE
2: B9939 FALSE FALSE FALSE FALSE
3: J3923 FALSE TRUE TRUE FALSE
4: H09203 TRUE TRUE FALSE TRUE
Upvotes: 1