Doctor David Anderson
Doctor David Anderson

Reputation: 274

Remove parts of pattern from string with gsub

I have a list of strings like this (58*5 cases omitted):

participant_01_Bullpup_1.xml
participant_01_Bullpup_2.xml
participant_01_Bullpup_3.xml
participant_01_Bullpup_4.xml
participant_01_Bullpup_5.xml
#...Through to...
participant_60_Bullpup_1.xml
participant_60_Bullpup_2.xml
participant_60_Bullpup_3.xml
participant_60_Bullpup_4.xml
participant_60_Bullpup_5.xml

I want to use gsub on these so that I end up with (example only):

01_1
60_5

Currently, my code is as follows:

fileNames <- Sys.glob("part*.csv")

for (fileName in fileNames) {
    sample <- read.csv(fileName, header = FALSE, sep = ",")
    part   <- gsub("[^0-9]+", "", substring(fileName, 5, last = 1000000L))
    print(part)
}

This results in the following strings (example):

011
605

However, I can't work out how to keep a single underscore between these strings.

Upvotes: 3

Views: 724

Answers (2)

Jota
Jota

Reputation: 17621

Here are a few more options (using akrun's str1):

gsub("[^0-9_]+|(?<=\\D)_", "", str1, perl=TRUE)
#[1] "01_1"
sub(".+?(\\d+_).+?(\\d+).+", "\\1\\2", str1, perl=TRUE)
#[1] "01_1"
sub(".+?(\\d+).+?(\\d+).+", "\\1_\\2", str1, perl=TRUE)
#[1] "01_1"
paste(strsplit(str1, "\\D+")[[1]][-1], collapse="_")
#[1] "01_1"

If your pattern really is that consistent (i.e. 12 characters before the first digits, followed by 8 characters until the next set of digits, followed by 4 more characters), then you can be explicit with your quantifiers:

sub(".{12}(\\d+_).{8}(\\d+).{4}", "\\1\\2", str1)
#[1] "01_1"

or simply access the characters use the appropriate indices:

paste0(substr(str1, 13, 15), substr(str1, 24, 24))
#[1] "01_1"

Upvotes: 1

akrun
akrun

Reputation: 887971

Try

sub('[^0-9]+_([0-9]+_).*([0-9]+).*', '\\1\\2', str1)
#[1] "01_1"

library(stringr)
sapply(str_extract_all(str1, '\\d+'), paste, collapse='_')

data

str1 <- 'participant_01_Bullpup_1.xml'

Upvotes: 3

Related Questions