Reputation: 667
Assume I have the following string (filename):
a <- "X/ZHEB100/TKN_VAR29380_timely_p1.txt"
which consists of several parts (here is given p1)
or another one
b <- "X/ZHEB100/ZHN_VAR29380_timely.txt"
which consists of only one part (so no need to label any p)
How can I extract the Identifier, which is the three letters before the VARXXXXX
(so in case one it would be TKN
, in case two it would be ZHN
) PLUS the part identifier, if available?
So the result should be:
case1 : TKN_p1
case2 : ZHN
I know how to extract the first identifier, but I cannot handle the second one at the same time.
My approach so far:
sub(".*(.{3})_VAR29380_timely(.{3}).*","\\1\\2", a)
sub(".*(.{3})_VAR29380_timely(.{3}).*","\\1\\2", b)
but this adds .tx
incorrectly in the second case.
Upvotes: 1
Views: 74
Reputation: 5580
Just another solution, for something different from Wiktor's already working solution:
library( magrittr )
data <- c( a, b )
First get the "ID" values by splitting on "/", taking the last value, and taking the first 3 characters of that:
ID <- strsplit( data, "/" ) %>%
sapply( tail, n = 1 ) %>%
substr( 1, 3 )
Then get the "part" values by splitting out both "timely" and ".txt", and taking the last element (which may be an empty string):
part <- strsplit( data, "timely|.txt" ) %>%
sapply( tail, n = 1 )
Now just paste them together for the result:
output <- paste0( ID, part )
output
[1] "TKN_p1" "ZHN"
Or, if you'd rather not create the intermediate objects:
output <- strsplit( data, "/" ) %>%
sapply( tail, n = 1 ) %>%
substr( 1, 3 ) %>%
paste0( strsplit( data, "timely|.txt" ) %>%
sapply( tail, n = 1 ) )
Upvotes: 1
Reputation: 626699
You are not using anchors and matching the last 3 characters right after timely
without checking what these characters are (.
matches any character).
I suggest
sub("^.*/([A-Z]{3})_VAR\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", a)
Details:
^
- start of string.*/
- part of string up to and including the last /
([A-Z]{3})
- 3 ASCII uppercase letters captured into Group 1_VAR\\d+_timely
- _VAR
+ 1 or more digits + _timely
(_[^_.]+)?
- an optional Group 2 capturing _
+ 1 or more chars other than _
and .
\\.
- a dot[^.]*
- zero or more chars other than .
$
- end of string.Replacement pattern contains 2 backreferences to both the capturing groups to insert their contents to the replaced string.
a <- "X/ZHEB100/TKN_VAR29380_timely_p1.txt"
a2 <- sub("^.*/([A-Z]{3})_VAR\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", a)
a2
[1] "TKN_p1"
b <- "X/ZHEB100/ZHN_VAR29380_timely.txt"
b2 <- sub("^.*/([A-Z]{3})_VAR\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", b)
b2
[1] "ZHN"
Upvotes: 2