Reputation: 407
I've been trying to write two regular expressions to doing the following two tasks:
I'd like to have the numbers stored in a column named "category" and the words stored in "diagnosis"
The strings are located in the column name "GROUPER_NAME".
df <- structure(list(GROUPER_ID = structure(c("9001742130", "9001742138",
"9001742058", "9001742062", "9001742102", "9001742247", "9001742055",
"9001742158", "9001742036", "9001742053"), label = "GROUPER_ID", format.sas = "$"),
GROUPER_NAME = structure(c("EDG ICD HCUP CCS 130 (PREDICTIVE MODELS-VERSION 1.0)-PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE",
"EDG ICD HCUP CCS 138 (PREDICTIVE MODELS-VERSION 1.0)-ESOPHAGEAL DISORDERS",
"EDG ICD HCUP CCS 58 (PREDICTIVE MODELS-VERSION 1.0)-OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS",
"EDG ICD HCUP CCS 62 (PREDICTIVE MODELS-VERSION 1.0)-COAGULATION AND HEMORRHAGIC DISORDERS",
"EDG ICD HCUP CCS 102 (PREDICTIVE MODELS-VERSION 1.0)-NONSPECIFIC CHEST PAIN",
"EDG ICD HCUP CCS 247 (PREDICTIVE MODELS-VERSION 1.0)-LYMPHADENITIS",
"EDG ICD HCUP CCS 55 (PREDICTIVE MODELS-VERSION 1.0)-FLUID AND ELECTROLYTE DISORDERS",
"EDG ICD HCUP CCS 158 (PREDICTIVE MODELS-VERSION 1.0)-CHRONIC KIDNEY DISEASE",
"EDG ICD HCUP CCS 36 (PREDICTIVE MODELS-VERSION 1.0)-CANCER OF THYROID",
"EDG ICD HCUP CCS 53 (PREDICTIVE MODELS-VERSION 1.0)-DISORDERS OF LIPID METABOLISM"
), label = "GROUPER_NAME", format.sas = "$")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
For the first example, I'd like to pull "159" and "URINARY TRACT INFECTIONS" and put them in columns "category" and "diagnosis," respectively. I've trying to alter some of the solutions on here to fit my scenario, but I'm really awful with regular expressions and cannot get anything to work. Any help would be greatly appreciated!
Upvotes: 3
Views: 83
Reputation: 79311
Now it is complete: I missed the second part first: NOW:
You could use pars_number
from readr
to extract the numbers
and sub to get the part after -
library(dplyr)
library(readr)
df %>%
mutate(category=parse_number(GROUPER_NAME), .before=GROUPER_NAME) %>%
mutate(diagnosis= sub(".*-", "", GROUPER_NAME), .keep="unused")
Output:
GROUPER_ID category diagnosis
<chr> <dbl> <chr>
1 9001742130 130 PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE
2 9001742138 138 ESOPHAGEAL DISORDERS
3 9001742058 58 OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS
4 9001742062 62 COAGULATION AND HEMORRHAGIC DISORDERS
5 9001742102 102 NONSPECIFIC CHEST PAIN
6 9001742247 247 LYMPHADENITIS
7 9001742055 55 FLUID AND ELECTROLYTE DISORDERS
8 9001742158 158 CHRONIC KIDNEY DISEASE
9 9001742036 36 CANCER OF THYROID
10 9001742053 53 DISORDERS OF LIPID METABOLISM
Upvotes: 3
Reputation: 887971
We can use sub
from base R
. Capture the digits (\\d+
) after the prefix substring, and the characters after the )
and -
. In the replacement, specify the backreference (\\1
, \\2
) of the captured group, and read them into a two column data.frame with read.csv
read.csv(text = sub("\\w+ \\w+ \\w+ \\w+ (\\d+)\\s.*\\)-(.*)",
"\\1:\\2", df$GROUPER_NAME), sep = ":", header = FALSE,
col.names = c("category", "diagnosis"))
-output
category diagnosis
1 130 PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE
2 138 ESOPHAGEAL DISORDERS
3 58 OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS
4 62 COAGULATION AND HEMORRHAGIC DISORDERS
5 102 NONSPECIFIC CHEST PAIN
6 247 LYMPHADENITIS
7 55 FLUID AND ELECTROLYTE DISORDERS
8 158 CHRONIC KIDNEY DISEASE
9 36 CANCER OF THYROID
10 53 DISORDERS OF LIPID METABOLISM
Upvotes: 5