user122514
user122514

Reputation: 407

Extracting number after a specific phrase

I've been trying to write two regular expressions to doing the following two tasks:

  1. Pull the numbers after the phrase "EDG ICD HCUP CCS"
  2. Pull the words after "EDG ICD HCUP CCS 159 (PREDICTIVE MODELS-VERSION 1.0)-"

I'd like to have the numbers stored in a column named "category" and the words stored in "diagnosis"

The strings are located in the column name "GROUPER_NAME".

df <- structure(list(GROUPER_ID = structure(c("9001742130", "9001742138", 
"9001742058", "9001742062", "9001742102", "9001742247", "9001742055", 
"9001742158", "9001742036", "9001742053"), label = "GROUPER_ID", format.sas = "$"), 
    GROUPER_NAME = structure(c("EDG ICD HCUP CCS 130 (PREDICTIVE MODELS-VERSION 1.0)-PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE", 
    "EDG ICD HCUP CCS 138 (PREDICTIVE MODELS-VERSION 1.0)-ESOPHAGEAL DISORDERS", 
    "EDG ICD HCUP CCS 58 (PREDICTIVE MODELS-VERSION 1.0)-OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS", 
    "EDG ICD HCUP CCS 62 (PREDICTIVE MODELS-VERSION 1.0)-COAGULATION AND HEMORRHAGIC DISORDERS", 
    "EDG ICD HCUP CCS 102 (PREDICTIVE MODELS-VERSION 1.0)-NONSPECIFIC CHEST PAIN", 
    "EDG ICD HCUP CCS 247 (PREDICTIVE MODELS-VERSION 1.0)-LYMPHADENITIS", 
    "EDG ICD HCUP CCS 55 (PREDICTIVE MODELS-VERSION 1.0)-FLUID AND ELECTROLYTE DISORDERS", 
    "EDG ICD HCUP CCS 158 (PREDICTIVE MODELS-VERSION 1.0)-CHRONIC KIDNEY DISEASE", 
    "EDG ICD HCUP CCS 36 (PREDICTIVE MODELS-VERSION 1.0)-CANCER OF THYROID", 
    "EDG ICD HCUP CCS 53 (PREDICTIVE MODELS-VERSION 1.0)-DISORDERS OF LIPID METABOLISM"
    ), label = "GROUPER_NAME", format.sas = "$")), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

For the first example, I'd like to pull "159" and "URINARY TRACT INFECTIONS" and put them in columns "category" and "diagnosis," respectively. I've trying to alter some of the solutions on here to fit my scenario, but I'm really awful with regular expressions and cannot get anything to work. Any help would be greatly appreciated!

Upvotes: 3

Views: 83

Answers (2)

TarJae
TarJae

Reputation: 79311

Now it is complete: I missed the second part first: NOW:

You could use pars_number from readr to extract the numbers and sub to get the part after -

library(dplyr)
library(readr)
df %>% 
  mutate(category=parse_number(GROUPER_NAME), .before=GROUPER_NAME) %>% 
  mutate(diagnosis=  sub(".*-", "", GROUPER_NAME), .keep="unused")

Output:

   GROUPER_ID category diagnosis                                            
   <chr>         <dbl> <chr>                                                
 1 9001742130      130 PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE           
 2 9001742138      138 ESOPHAGEAL DISORDERS                                 
 3 9001742058       58 OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS
 4 9001742062       62 COAGULATION AND HEMORRHAGIC DISORDERS                
 5 9001742102      102 NONSPECIFIC CHEST PAIN                               
 6 9001742247      247 LYMPHADENITIS                                        
 7 9001742055       55 FLUID AND ELECTROLYTE DISORDERS                      
 8 9001742158      158 CHRONIC KIDNEY DISEASE                               
 9 9001742036       36 CANCER OF THYROID                                    
10 9001742053       53 DISORDERS OF LIPID METABOLISM   

Upvotes: 3

akrun
akrun

Reputation: 887971

We can use sub from base R. Capture the digits (\\d+) after the prefix substring, and the characters after the ) and -. In the replacement, specify the backreference (\\1, \\2) of the captured group, and read them into a two column data.frame with read.csv

read.csv(text = sub("\\w+ \\w+ \\w+ \\w+ (\\d+)\\s.*\\)-(.*)", 
         "\\1:\\2", df$GROUPER_NAME), sep = ":", header = FALSE, 
      col.names = c("category", "diagnosis"))

-output

 category                                             diagnosis
1       130            PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE
2       138                                  ESOPHAGEAL DISORDERS
3        58 OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS
4        62                 COAGULATION AND HEMORRHAGIC DISORDERS
5       102                                NONSPECIFIC CHEST PAIN
6       247                                         LYMPHADENITIS
7        55                       FLUID AND ELECTROLYTE DISORDERS
8       158                                CHRONIC KIDNEY DISEASE
9        36                                     CANCER OF THYROID
10       53                         DISORDERS OF LIPID METABOLISM

Upvotes: 5

Related Questions