vivek
vivek

Reputation: 311

Extracting specific data from text column in R

I have a data set of medicine names in a column. I am trying to extract the name ,strength and unit of each medicine from this data. The term MG and ML are the qualifiers of strength in the setup. For example, let us consider the following given data set for the names of the medicines.

 Medicine name
----------------------
 FALCAN 150 MG tab
 AUGMENTIN 500MG tab
 PRE-13 0.5 ML PFS inj
 NS.9%w/v 250 ML, Glass Bottle

I want to extract the following information columns from this data set,

Name     | Strength |Unit
---------| ---------|------
FALCAN   | 150      |MG
AUGMENTIN| 500      |MG
PRE-13   | 0.5      |ML
NS.9%w/v | 250      |ML

I have tried grepl etc command and could not find a good solution. I have around >12000 data to identify. Data does not follow a fixed pattern, and at few places MG and strength does not have a space in between such as 300MG. ,

Upvotes: 0

Views: 2917

Answers (3)

Carl Boneri
Carl Boneri

Reputation: 2722

A <- trimws(strsplit('FALCAN 150 MG tab
 AUGMENTIN 500MG tab
PRE-13 0.5 ML PFS inj
NS.9%w/v 250 ML, Glass Bottle',"\n")[[1]])

plyr::ldply(strsplit(A," "), function(i){
    new <- gsub("[[:punct:]]$","",i)
    Unit <- gsub("[0-9]","",new[grep("^([0-9]{1,})?[A-Z]{2}$", new)])
    data.frame(
        Name = i[[1]], Strength = gsub("[A-z]",'',i[[2]]),Unit= Unit,
        stringsAsFactors = F
    )
})

       Name Strength Unit
1    FALCAN      150   MG
2 AUGMENTIN      500   MG
3    PRE-13      0.5   ML
4  NS.9%w/v      250   ML

Upvotes: 0

Wietze314
Wietze314

Reputation: 6020

You can achieve this with multiple regular expressions. All thought I am not a regex champion I use it for the same purpose as you present here.

meds <- c('FALCAN 150 MG tab',
'AUGMENTIN 500MG tab',
'PRE-13 0.5 ML PFS inj',
'NS.9%w/v 250 ML, Glass Bottle')

library(stringr)

#Name
trimws(str_extract(str_extract(meds, '.* [0-9.]{3}'),'.* '))

#Strength
str_extract(str_extract(meds, '[0-9.]{3}( M|M)[GL]'),'[0-9.]*')

#Unit
str_extract(str_extract(meds, '( M|[0-9]M)[GL]'), 'M[GL]')

I know that a lot of these medicine notations can be quite different, thus I prefer to extract each item by regular expressions, in contrast to the solution presented by G. Grothendieck, who expects a certain structure in the data (3 columns). That way I am able to tweak each item, by inspecting all the strings that generate NA values.

Upvotes: 0

G. Grothendieck
G. Grothendieck

Reputation: 269346

If the input L is as given reproducibly in the Note at the end then use sub to replace MG or ML and everything after with a space followed by MG or ML and then read it using read.table:

s <- sub("(M[GL]).*", " \\1", L)
read.table(text = s, as.is = TRUE, skip = 1, col.names = c("Name", "Strength", "Unit"))

giving:

       Name Strength Unit
1    FALCAN    150.0   MG
2 AUGMENTIN    500.0   MG
3    PRE-13      0.5   ML
4  NS.9%w/v    250.0   ML

Note: The input L used is:

L <- c("Medicine name", " FALCAN 150 MG tab", " AUGMENTIN 500MG tab", 
" PRE-13 0.5 ML PFS inj", " NS.9%w/v 250 ML, Glass Bottle")

Upvotes: 2

Related Questions