Extract text with gsub

Question

I am setting up an automated data analysis procedure and, more or less at the end of the procedure, I would like to extract automatically the name of the file that has been analysed. I have a data frame with a column containing names, with the following style:

Baseline/Cell_Line_2_KB_1813_B_Baseline
Dose 0001/Cell_Line_3_KB1720_1_0001
Dose 0010/Cell_Line_1_KB1810 mat_0010

I would like to extract just the characters in bold: "KB_1813_B", "KB1720_1" and "KB1810 mat" in a separate column.

I used gsub with the following command:

df$column.with.names <- gsub(".*KB|_.*", "KB", df$column.with.new.names)

I could easily remove the first part of the problem, but I am stuck trying to remove the second part. Is there some command in gsub to remove everything, starting from the end of the name, until you encounter a special character ( "_" in my case)?

Thank you :)

akrun · Accepted Answer

We can use str_extract

library(stringr)
str_extract(df$column.with.new.names, "KB_*\d+[_ ]*[^_]*")
#[1] "KB_1813_B"  "KB1720_1"   "KB1810 mat"

Or the same pattern can be captured as a group with sub

sub(".*(KB_*\d+[_ ]*[^_]*).*", "\1", df$column.with.new.names)
#[1] "KB_1813_B"  "KB1720_1"   "KB1810 mat"

data

df <- data.frame(column.with.new.names = c("Baseline/Cell_Line_2_KB_1813_B_Baseline", 
 "Dose 0001/Cell_Line_3_KB1720_1_0001",
  "Dose 0010/Cell_Line_1_KB1810 mat_0010"), stringsAsFactors = FALSE)

Extract text with gsub

Answers (2)

data

Related Questions