user3032689
user3032689

Reputation: 667

Extract 2 parts of a string

Assume I have the following string (filename):

a <- "X/ZHEB100/TKN_VAR29380_timely_p1.txt"

which consists of several parts (here is given p1)

or another one

b <- "X/ZHEB100/ZHN_VAR29380_timely.txt"

which consists of only one part (so no need to label any p)

How can I extract the Identifier, which is the three letters before the VARXXXXX (so in case one it would be TKN, in case two it would be ZHN) PLUS the part identifier, if available?

So the result should be:

case1 : TKN_p1
case2 : ZHN

I know how to extract the first identifier, but I cannot handle the second one at the same time.

My approach so far:

sub(".*(.{3})_VAR29380_timely(.{3}).*","\\1\\2", a)
sub(".*(.{3})_VAR29380_timely(.{3}).*","\\1\\2", b)

but this adds .tx incorrectly in the second case.

Upvotes: 1

Views: 74

Answers (2)

rosscova
rosscova

Reputation: 5580

Just another solution, for something different from Wiktor's already working solution:

library( magrittr )
data <- c( a, b )

First get the "ID" values by splitting on "/", taking the last value, and taking the first 3 characters of that:

ID <- strsplit( data, "/" ) %>%
    sapply( tail, n = 1 ) %>%
    substr( 1, 3 )

Then get the "part" values by splitting out both "timely" and ".txt", and taking the last element (which may be an empty string):

part <- strsplit( data, "timely|.txt" ) %>%
    sapply( tail, n = 1 )

Now just paste them together for the result:

output <- paste0( ID, part )
output
[1] "TKN_p1" "ZHN"

Or, if you'd rather not create the intermediate objects:

output <- strsplit( data, "/" ) %>%
    sapply( tail, n = 1 ) %>%
    substr( 1, 3 ) %>%
    paste0( strsplit( data, "timely|.txt" ) %>%
                      sapply( tail, n = 1 ) )

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626699

You are not using anchors and matching the last 3 characters right after timely without checking what these characters are (. matches any character).

I suggest

sub("^.*/([A-Z]{3})_VAR\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", a)

Details:

  • ^ - start of string
  • .*/ - part of string up to and including the last /
  • ([A-Z]{3}) - 3 ASCII uppercase letters captured into Group 1
  • _VAR\\d+_timely - _VAR + 1 or more digits + _timely
  • (_[^_.]+)? - an optional Group 2 capturing _ + 1 or more chars other than _ and .
  • \\. - a dot
  • [^.]* - zero or more chars other than .
  • $ - end of string.

Replacement pattern contains 2 backreferences to both the capturing groups to insert their contents to the replaced string.

R demo:

a <- "X/ZHEB100/TKN_VAR29380_timely_p1.txt"
a2 <- sub("^.*/([A-Z]{3})_VAR\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", a)
a2
[1] "TKN_p1"
b <- "X/ZHEB100/ZHN_VAR29380_timely.txt"
b2 <- sub("^.*/([A-Z]{3})_VAR\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", b)
b2
[1] "ZHN"

Upvotes: 2

Related Questions