5ant1
5ant1

Reputation: 33

extracting alphanumeric patterns from a character string whose values vary in R

my gratitude in advance for any help and apologies for not being able to figure this out from other examples.

I have a vector containing names of files such as: vec = c("Img_1_(set1)_2L4_s.ext", "Img_37_(set19)_2R4_s.ext", "Img_187_(set94)_4L4_s.ext", "Img_77_(set39)_4R2_s.ext")

I want to create two--separate--additional vectors from extracting:

1. The key letter (either L or R) between the numbers that go side-by-side, which vary from case to case. e.g., result: L,R,L,R

2. The "set" string, plus the number--which varies across cases--attached to it between brackets, with and without the brackets. e.g., result1: (set1), (set19), (set94), (set39); result2: set1, set19, set94, set39

Ideally using either stringer(), but I'm open to other --simpler?-- libraries/functions.

For case 1., I tried str_extract(vec, "(?<= \\)_)[0-9]*"), as a way to get the ")_" pattern followed by a number [0-9] but all I get in return are NAs (I think I'm not quite passing alright the ")" pattern well).

For case 2., I had to made do by simply extracting the set numbers str_extract(vec, "(?<=set)[0-9]*"), and create another variable by pasting the "set" word; obviously not ideal with large data frames.

Upvotes: 1

Views: 148

Answers (3)

M.Viking
M.Viking

Reputation: 5398

There is a newer tidyr library function separate_wider_delim() that is good in this situation, I also used str_extract()

df %>% separate_wider_delim(
  vec,
  delim = "_",
  names = c(NA, "ImgNum", "SetGroup", "ID", NA)) %>% 
  mutate(LR = str_extract(ID, pattern="[LR]"), )

#  ImgNum SetGroup ID    LR   
#1 1      (set1)   2L4   L    
#2 37     (set19)  2R4   R    
#3 187    (set94)  4L4   L    
#4 77     (set39)  4R2   R  

Upvotes: 0

Gregor Thomas
Gregor Thomas

Reputation: 146070

The set pattern is nice and easy, the letters "set" followed by one more more numbers "[0-9]+".

At least for your examples, it seems like the letters L and R don't show up anywhere else, so we can do a very simple pattern for them too, just look for an L or an R: "L|R".

set = str_extract(vec, pattern = "set[0-9]+")
main = str_extract(vec, pattern = "L|R")
set
# [1] "set1"  "set19" "set94" "set39"
main
# [1] "L" "R" "L" "R"

If you're worried about potentially getting false hits on the L or R because they might show up elsewhere in the input, you could make the pattern more specific, for example looking behind for a number "(?<=[0-9])" and looking ahead for a number "(?=[0-9])":

main2 = str_extract(vec, pattern = "(?<=[0-9])L|R(?=[0-9])")
main2
# [1] "L" "R" "L" "R"

And if you do want the parens with the set, you escape parens to include them in the pattern:

set_with_paren = str_extract(vec, pattern = "\\(set[0-9]+\\)")
set_with_paren
# [1] "(set1)"  "(set19)" "(set94)" "(set39)"

Upvotes: 0

r2evans
r2evans

Reputation: 160687

I think we can use strcapture for this, returning a two-column data.frame.

strcapture(".*(set[0-9]+).*[0-9]([LR])[0-9].*", vec, proto=list(result1="", lr=""))
#   result1 lr
# 1    set1  L
# 2   set19  R
# 3   set94  L
# 4   set39  R

You said "with and without the brackets", so perhaps

strcapture(".*(\\(set[0-9]+\\)).*[0-9]([LR])[0-9].*", vec, proto=list(result1="", lr="")) |>
  transform(result2 = gsub("[()]", "", result1))
#   result1 lr result2
# 1  (set1)  L    set1
# 2 (set19)  R   set19
# 3 (set94)  L   set94
# 4 (set39)  R   set39

If your vec is a column in a data.frame and you're otherwise using mutate, we can use it fairly conveniently as

library(dplyr)
dat <- data.frame(something = vec)
dat |>
  mutate(strcapture(".*(set[0-9]+).*[0-9]([LR])[0-9].*", vec, proto=list(result1="", lr="")))
#                   something result1 lr
# 1    Img_1_(set1)_2L4_s.ext    set1  L
# 2  Img_37_(set19)_2R4_s.ext   set19  R
# 3 Img_187_(set94)_4L4_s.ext   set94  L
# 4  Img_77_(set39)_4R2_s.ext   set39  R

(Noting that we don't name the return value, since mutate accepts a data.frame and just cbinds the results.)

Upvotes: 3

Related Questions