Mollan
Mollan

Reputation: 135

Extract text with gsub

I am setting up an automated data analysis procedure and, more or less at the end of the procedure, I would like to extract automatically the name of the file that has been analysed. I have a data frame with a column containing names, with the following style:

I would like to extract just the characters in bold: "KB_1813_B", "KB1720_1" and "KB1810 mat" in a separate column.

I used gsub with the following command:

df$column.with.names <- gsub(".*KB|_.*", "KB", df$column.with.new.names)

I could easily remove the first part of the problem, but I am stuck trying to remove the second part. Is there some command in gsub to remove everything, starting from the end of the name, until you encounter a special character ( "_" in my case)?

Thank you :)

Upvotes: 1

Views: 472

Answers (2)

thc
thc

Reputation: 9705

The way to do this is using regex groups:

x <- c("Baseline/Cell_Line_2_KB_1813_B_Baseline",
"Dose 0001/Cell_Line_3_KB1720_1_0001",
"Dose 0010/Cell_Line_1_KB1810 mat_0010")

gsub("^.+Cell_Line_._(.+)_.+$", "\\1", x)
[1] "KB_1813_B"  "KB1720_1"   "KB1810 mat"

Upvotes: 2

akrun
akrun

Reputation: 887851

We can use str_extract

library(stringr)
str_extract(df$column.with.new.names, "KB_*\\d+[_ ]*[^_]*")
#[1] "KB_1813_B"  "KB1720_1"   "KB1810 mat"

Or the same pattern can be captured as a group with sub

sub(".*(KB_*\\d+[_ ]*[^_]*).*", "\\1", df$column.with.new.names)
#[1] "KB_1813_B"  "KB1720_1"   "KB1810 mat"

data

df <- data.frame(column.with.new.names = c("Baseline/Cell_Line_2_KB_1813_B_Baseline", 
 "Dose 0001/Cell_Line_3_KB1720_1_0001",
  "Dose 0010/Cell_Line_1_KB1810 mat_0010"), stringsAsFactors = FALSE)

Upvotes: 2

Related Questions