Reputation: 67
I have a variable, a, it contains characters like :
DEVICE PRF .75MG 0.5ML
DEVICE PRF 1.5MG 0.5MLX4
CAP 12-25MG 30
CAP DR 60MG 100UD 3270-33 (32%)
I would like to split them into three parts(or variables):
x y z
DEVICE PRF .75MG 0.5ML
DEVICE PRF 1.5MG 0.5MLX4
CAP 12-25MG 30
CAP DR 60MG 100UD 3270-33 (32%)
The first part is the description, the second is the strength, and the third part is the volume. I think I can use gregexpr(), but not sure how to implement it. Any suggestions are appreciated. Thank you!
Upvotes: 2
Views: 33
Reputation: 48191
You could use
library(stringr)
str_match(x, "(.*)[ ]{1,}(.*(MG|ML))[ ]{1,}(.*)")[, -c(1, 4)]
# [,1] [,2] [,3]
# [1,] "DEVICE PRF" ".75MG" "0.5ML"
# [2,] "DEVICE PRF" "1.5MG" "0.5MLX4"
# [3,] "CAP" "12-25MG" "30"
# [4,] "CAP DR" "60MG" "100UD 3270-33 (32%)"
Assuming that the second/middle part always ends with MG or ML and has no spaces.
The pattern (.*)[ ]{1,}(.*(MG|ML))[ ]{1,}(.*)
could be read as: the first part to match containing anything + at least one space + the second part to match ending in MG or ML + at least one space + the third part to match containing anything.
Upvotes: 1
Reputation: 5191
Using the assumption that the middle part has no spaces and always starts with a .
or digit, we can do this in base R like this:
a <- c("DEVICE PRF .75MG 0.5ML", "DEVICE PRF 1.5MG 0.5MLX4",
"CAP 12-25MG 30", "CAP DR 60MG 100UD 3270-33 (32%)")
a_as_csv <- sub('([^.0-9]*) ([.0-9][^ ]+) (.*)', '\\1,\\2,\\3', a)
read.csv(textConnection(a_as_csv), col.names = c('x', 'y', 'z'), header = F)
# x y z
# 1 DEVICE PRF .75MG 0.5ML
# 2 DEVICE PRF 1.5MG 0.5MLX4
# 3 CAP 12-25MG 30
# 4 CAP DR 60MG 100UD 3270-33 (32%)
Upvotes: 0