Reputation: 33
my gratitude in advance for any help and apologies for not being able to figure this out from other examples.
I have a vector containing names of files such as: vec = c("Img_1_(set1)_2L4_s.ext", "Img_37_(set19)_2R4_s.ext", "Img_187_(set94)_4L4_s.ext", "Img_77_(set39)_4R2_s.ext")
I want to create two--separate--additional vectors from extracting:
1. The key letter (either L or R) between the numbers that go side-by-side, which vary from case to case. e.g., result: L,R,L,R
2. The "set" string, plus the number--which varies across cases--attached to it between brackets, with and without the brackets. e.g., result1: (set1), (set19), (set94), (set39); result2: set1, set19, set94, set39
Ideally using either stringer(), but I'm open to other --simpler?-- libraries/functions.
For case 1., I tried str_extract(vec, "(?<= \\)_)[0-9]*")
, as a way to get the ")_" pattern followed by a number [0-9] but all I get in return are NAs (I think I'm not quite passing alright the ")" pattern well).
For case 2., I had to made do by simply extracting the set numbers str_extract(vec, "(?<=set)[0-9]*")
, and create another variable by pasting the "set" word; obviously not ideal with large data frames.
Upvotes: 1
Views: 148
Reputation: 5398
There is a newer tidyr
library function separate_wider_delim()
that is good in this situation, I also used str_extract()
df %>% separate_wider_delim(
vec,
delim = "_",
names = c(NA, "ImgNum", "SetGroup", "ID", NA)) %>%
mutate(LR = str_extract(ID, pattern="[LR]"), )
# ImgNum SetGroup ID LR
#1 1 (set1) 2L4 L
#2 37 (set19) 2R4 R
#3 187 (set94) 4L4 L
#4 77 (set39) 4R2 R
Upvotes: 0
Reputation: 146070
The set
pattern is nice and easy, the letters "set"
followed by one more more numbers "[0-9]+"
.
At least for your examples, it seems like the letters L and R don't show up anywhere else, so we can do a very simple pattern for them too, just look for an L or an R: "L|R"
.
set = str_extract(vec, pattern = "set[0-9]+")
main = str_extract(vec, pattern = "L|R")
set
# [1] "set1" "set19" "set94" "set39"
main
# [1] "L" "R" "L" "R"
If you're worried about potentially getting false hits on the L or R because they might show up elsewhere in the input, you could make the pattern more specific, for example looking behind for a number "(?<=[0-9])"
and looking ahead for a number "(?=[0-9])"
:
main2 = str_extract(vec, pattern = "(?<=[0-9])L|R(?=[0-9])")
main2
# [1] "L" "R" "L" "R"
And if you do want the parens with the set, you escape parens to include them in the pattern:
set_with_paren = str_extract(vec, pattern = "\\(set[0-9]+\\)")
set_with_paren
# [1] "(set1)" "(set19)" "(set94)" "(set39)"
Upvotes: 0
Reputation: 160687
I think we can use strcapture
for this, returning a two-column data.frame
.
strcapture(".*(set[0-9]+).*[0-9]([LR])[0-9].*", vec, proto=list(result1="", lr=""))
# result1 lr
# 1 set1 L
# 2 set19 R
# 3 set94 L
# 4 set39 R
You said "with and without the brackets", so perhaps
strcapture(".*(\\(set[0-9]+\\)).*[0-9]([LR])[0-9].*", vec, proto=list(result1="", lr="")) |>
transform(result2 = gsub("[()]", "", result1))
# result1 lr result2
# 1 (set1) L set1
# 2 (set19) R set19
# 3 (set94) L set94
# 4 (set39) R set39
If your vec
is a column in a data.frame
and you're otherwise using mutate
, we can use it fairly conveniently as
library(dplyr)
dat <- data.frame(something = vec)
dat |>
mutate(strcapture(".*(set[0-9]+).*[0-9]([LR])[0-9].*", vec, proto=list(result1="", lr="")))
# something result1 lr
# 1 Img_1_(set1)_2L4_s.ext set1 L
# 2 Img_37_(set19)_2R4_s.ext set19 R
# 3 Img_187_(set94)_4L4_s.ext set94 L
# 4 Img_77_(set39)_4R2_s.ext set39 R
(Noting that we don't name the return value, since mutate
accepts a data.frame
and just cbind
s the results.)
Upvotes: 3