Reputation: 135
I am setting up an automated data analysis procedure and, more or less at the end of the procedure, I would like to extract automatically the name of the file that has been analysed. I have a data frame with a column containing names, with the following style:
I would like to extract just the characters in bold: "KB_1813_B", "KB1720_1" and "KB1810 mat" in a separate column.
I used gsub with the following command:
df$column.with.names <- gsub(".*KB|_.*", "KB", df$column.with.new.names)
I could easily remove the first part of the problem, but I am stuck trying to remove the second part. Is there some command in gsub to remove everything, starting from the end of the name, until you encounter a special character ( "_" in my case)?
Thank you :)
Upvotes: 1
Views: 472
Reputation: 9705
The way to do this is using regex groups:
x <- c("Baseline/Cell_Line_2_KB_1813_B_Baseline",
"Dose 0001/Cell_Line_3_KB1720_1_0001",
"Dose 0010/Cell_Line_1_KB1810 mat_0010")
gsub("^.+Cell_Line_._(.+)_.+$", "\\1", x)
[1] "KB_1813_B" "KB1720_1" "KB1810 mat"
Upvotes: 2
Reputation: 887851
We can use str_extract
library(stringr)
str_extract(df$column.with.new.names, "KB_*\\d+[_ ]*[^_]*")
#[1] "KB_1813_B" "KB1720_1" "KB1810 mat"
Or the same pattern can be captured as a group with sub
sub(".*(KB_*\\d+[_ ]*[^_]*).*", "\\1", df$column.with.new.names)
#[1] "KB_1813_B" "KB1720_1" "KB1810 mat"
df <- data.frame(column.with.new.names = c("Baseline/Cell_Line_2_KB_1813_B_Baseline",
"Dose 0001/Cell_Line_3_KB1720_1_0001",
"Dose 0010/Cell_Line_1_KB1810 mat_0010"), stringsAsFactors = FALSE)
Upvotes: 2