Extracting number after a specific phrase

Question

I've been trying to write two regular expressions to doing the following two tasks:

Pull the numbers after the phrase "EDG ICD HCUP CCS"
Pull the words after "EDG ICD HCUP CCS 159 (PREDICTIVE MODELS-VERSION 1.0)-"

I'd like to have the numbers stored in a column named "category" and the words stored in "diagnosis"

The strings are located in the column name "GROUPER_NAME".

df <- structure(list(GROUPER_ID = structure(c("9001742130", "9001742138", 
"9001742058", "9001742062", "9001742102", "9001742247", "9001742055", 
"9001742158", "9001742036", "9001742053"), label = "GROUPER_ID", format.sas = "$"), 
    GROUPER_NAME = structure(c("EDG ICD HCUP CCS 130 (PREDICTIVE MODELS-VERSION 1.0)-PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE", 
    "EDG ICD HCUP CCS 138 (PREDICTIVE MODELS-VERSION 1.0)-ESOPHAGEAL DISORDERS", 
    "EDG ICD HCUP CCS 58 (PREDICTIVE MODELS-VERSION 1.0)-OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS", 
    "EDG ICD HCUP CCS 62 (PREDICTIVE MODELS-VERSION 1.0)-COAGULATION AND HEMORRHAGIC DISORDERS", 
    "EDG ICD HCUP CCS 102 (PREDICTIVE MODELS-VERSION 1.0)-NONSPECIFIC CHEST PAIN", 
    "EDG ICD HCUP CCS 247 (PREDICTIVE MODELS-VERSION 1.0)-LYMPHADENITIS", 
    "EDG ICD HCUP CCS 55 (PREDICTIVE MODELS-VERSION 1.0)-FLUID AND ELECTROLYTE DISORDERS", 
    "EDG ICD HCUP CCS 158 (PREDICTIVE MODELS-VERSION 1.0)-CHRONIC KIDNEY DISEASE", 
    "EDG ICD HCUP CCS 36 (PREDICTIVE MODELS-VERSION 1.0)-CANCER OF THYROID", 
    "EDG ICD HCUP CCS 53 (PREDICTIVE MODELS-VERSION 1.0)-DISORDERS OF LIPID METABOLISM"
    ), label = "GROUPER_NAME", format.sas = "$")), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

For the first example, I'd like to pull "159" and "URINARY TRACT INFECTIONS" and put them in columns "category" and "diagnosis," respectively. I've trying to alter some of the solutions on here to fit my scenario, but I'm really awful with regular expressions and cannot get anything to work. Any help would be greatly appreciated!

akrun · Accepted Answer

We can use sub from base R. Capture the digits (\d+) after the prefix substring, and the characters after the ) and -. In the replacement, specify the backreference (\1, \2) of the captured group, and read them into a two column data.frame with read.csv

read.csv(text = sub("\w+ \w+ \w+ \w+ (\d+)\s.*\)-(.*)", 
         "\1:\2", df$GROUPER_NAME), sep = ":", header = FALSE, 
      col.names = c("category", "diagnosis"))

-output

 category                                             diagnosis
1       130            PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE
2       138                                  ESOPHAGEAL DISORDERS
3        58 OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS
4        62                 COAGULATION AND HEMORRHAGIC DISORDERS
5       102                                NONSPECIFIC CHEST PAIN
6       247                                         LYMPHADENITIS
7        55                       FLUID AND ELECTROLYTE DISORDERS
8       158                                CHRONIC KIDNEY DISEASE
9        36                                     CANCER OF THYROID
10       53                         DISORDERS OF LIPID METABOLISM

Extracting number after a specific phrase

Answers (2)

Related Questions