James Knight
James Knight

Reputation: 3

R: Text separation into new columns ie mutate in R

I have a very large dataset (189k) with >250 variables where the variables have been inserted using web-based tick boxes. Some variables have then been combined into single lines such as medical comorbidities:

Chronic Pulmonary Disease|Congestive Cardiac Failure|Hypertension|Diabetes|Obesity (BMI>35)|Ischaemic Heart Disease

This variable has ~1500 combinations of medical conditions like the above line. I want to mutate into separate columns for example Col 1: Hypertension (Yes/No), Col 2: Diabetes (Yes/No) ... so that the presence or absence of a pre-existing condition can be used as predictive variables.

Is there a way to code this in R?

Upvotes: 0

Views: 62

Answers (1)

rg255
rg255

Reputation: 4169

It looks to me like you have data where there is a variable of all medical conditions, separated by | — something like:

   Patient                                       Comorbids
1:  D88310                       Diabetes|Obesity (BMI>35)
2:   B9939                                            <NA>
3:   J3923                   Hypertension|Obesity (BMI>35)
4:  H09203 Hypertension|Diabetes|Chronic Pulmonary Disease

Using the tstrsplit() function from the data.table package to split that up and grepl() you can score the presence of each disease in each patient as:

# Remove braces (pay attention to these sorts of issues)
data1[, Comorbids := gsub("\\(|\\)", "", Comorbids)]

# Split the strings into individual values - unique used to find all unique values
conditions <- unique(unlist(tstrsplit(data1[, Comorbids], "\\|")))
conditions <- conditions[!is.na(conditions)]

# Score the occurence and add on to data
data2 <- data.table(data1[, -c("Comorbids")], 
                    sapply(conditions, grepl, data1[, Comorbids]))

Giving:

   Patient Diabetes Hypertension Obesity BMI>35 Chronic Pulmonary Disease
1:  D88310     TRUE        FALSE           TRUE                     FALSE
2:   B9939    FALSE        FALSE          FALSE                     FALSE
3:   J3923    FALSE         TRUE           TRUE                     FALSE
4:  H09203     TRUE         TRUE          FALSE                      TRUE

Upvotes: 1

Related Questions