Svenja
Svenja

Reputation: 119

Extract numeric and character values from a character variable in data.table R

I have the following data.table

df <- data.table(id=c(1,2,3,4),
                 medication=c("Abc de 3 MG", "Afg frt re 4 MG/ML","Agh","Aj yr 5 MG"))

with

id         medication
1:  1        Abc de 3 MG
2:  2 Afg frt re 4 MG/ML
3:  3                Agh
4:  4         Aj yr 5 MG

I want to extract the doses from the medication, and create a column called doses

id medication   doses
1:  1     Abc de    3 MG
2:  2 Afg frt re 4 MG/ML
3:  3        Agh    <NA>
4:  4      Aj yr    5 MG

It should contain the number and unit. Not every medication has a number and unit which should be included as NA.

I looked at the tidyverse extract function but could not find something to extract numeric and character values. I am using data.table with a large dataset. A time efficient function is great.

Upvotes: 1

Views: 498

Answers (4)

Mike V
Mike V

Reputation: 1364

Though this method is not data.table, but you can take it into account

library(tidyr)
df %>% 
  separate(medication, into = c("medication", "doses"), sep = "(?=\\d)")
# id  medication   doses
# 1  1     Abc de     3 MG
# 2  2 Afg frt re  4 MG/ML
# 3  3         Agh    <NA>
# 4  4      Aj yr     5 MG

Upvotes: 1

akrun
akrun

Reputation: 887251

An option with extract from tidyr

library(tidyr)
extract(df, medication, into = c('medication', 'doses'), '(.*)\\s+(\\d+\\s+\\D+)$')
#   id medication   doses
#1:  1     Abc de    3 MG
#2:  2 Afg frt re 4 MG/ML
#3:  3       <NA>    <NA>
#4:  4      Aj yr    5 MG

Upvotes: 0

s_baldur
s_baldur

Reputation: 33488

Insert an @ (or any other character that is not in your column already) ahead of the first number, then use that to split the column into two:

df[, c("medication", "doses") := tstrsplit(sub("([0-9])", "@\\1", medication), "@")]
df

#    id  medication   doses
# 1:  1     Abc de     3 MG
# 2:  2 Afg frt re  4 MG/ML
# 3:  3         Agh    <NA>
# 4:  4      Aj yr     5 MG

EDIT

A cleanr solution is using slightly more advanced regex (positive lookahead), just need to remember perl = TRUE:

df[, c("medication", "doses") := tstrsplit(medication, ".(?=[0-9])", perl = TRUE)]

Upvotes: 2

ThomasIsCoding
ThomasIsCoding

Reputation: 101818

Maybe you can try strsplit like below

df[-1] <- do.call(rbind,lapply(strsplit(df$medication,"(?<=[A-Za-z])\\s(?=[0-9])",perl = TRUE),`length<-`,2))

which gives

> df
  id medication.1 medication.2
1  1       Abc de         3 MG
2  2   Afg frt re      4 MG/ML
3  3          Agh         <NA>
4  4        Aj yr         5 MG

Upvotes: 0

Related Questions