Reputation: 1337
I have a data frame containing file paths and names formatted as follows:
files_list <- c(
"C:/User/Name/Folder/Subfolder1/Sub-subfolder/file.txt",
"C:/User/Name/Folder/Subfolder1/Sub-subfolder/file - Copy.txt",
"C:/User/Name/Folder/Subfolder1/Sub-subfolder/file (1).txt",
"C:/User/Name/Folder/Subfolder1/Sub-subfolder/file - Copy (2).txt",
"C:/User/Name/Folder/Subfolder1/fileB.txt",
"C:/User/Name/Folder/file.C.txt",
"C:/User/Name/Folder/file-D.txt",
"C:/User/Name/Folder/file",
"C:/User/Name/Folder/file Z.txt",
"C:/User/Name/Folder/file - backup.txt"
)
Every file has a parent folder and a name. These names may include one or more periods "." and/or dashes "-". In addition, some have a "Copy" notation, number designation, and/or file extension. I want to convert the data to something that looks like this:
[1] "Sub-subfolder file txt"
[2] "Sub-subfolder file Copy txt"
[3] "Sub-subfolder file 1 txt"
[4] "Sub-subfolder file Copy 2 txt"
[5] "Subfolder1 fileB txt"
[6] "Folder file.C txt"
[7] "Folder file-D txt"
[8] "Folder file"
[9] "Folder file Z txt"
[10] "Folder file - backup txt"
This is the code that I think should do the trick:
sub(
"(^.:/)([^/.]+/)*([^/.]+/)([^/]+)(\\s-\\sCopy)?(\\s\\(([0-9]+)\\))?(\\.([^.]+))?$",
"\\3 \\4 \\5 \\7 \\9",
files_list
)
But what I get is this:
[1] "Sub-subfolder/ file.txt "
[2] "Sub-subfolder/ file - Copy.txt "
[3] "Sub-subfolder/ file (1).txt "
[4] "Sub-subfolder/ file - Copy (2).txt "
[5] "Subfolder1/ fileB.txt "
[6] "Folder/ file.C.txt "
[7] "Folder/ file-D.txt "
The slashes "/" and extra spaces I can deal with, but the "Copy" notations, number designations, and file extensions are not being set apart as I expect.
Any suggestions on how to identify the "Copy" notations, number designations, and file extensions? Or should I just identify the parent folders in one line of code and separate the rest in another line?
(Ultimately, I'm going to convert these text strings into a data frame with the folder, filename, copy designation, and extension are separate columns. I'm pretty sure I could do this with tidyr::separate
, but even that requires an understanding of regex, and I want to learn how to use ()
and back references.)
Upvotes: 0
Views: 538
Reputation: 1337
Sorry if this isn't the best way to do this. I've realized that my question was incomplete, and I want to make the question more complete while also sharing the solution I came up with.
I want this code to deal with the full range of possible name structures:
I used this code to generate example file names/paths cover all folder/name/Copy/number/extension combinations:
files.df <- expand.grid(
c("C:/"),
c("", "F1/", "F1/F2/"),
c("folder/"),
c("file"),
c("", " space", "-dash", " - spacedash", ".period", ".firstperiod.secondperiod"),
c("", 1, " 1", 10, " 10"),
c("", " - Copy"),
c("", " (1)", " (10)"),
c("", ".999", ".aaa"),
stringsAsFactors = F
)
for (i in 1:nrow(files.df)) {
if (!exists("x")) {
x <- vector(mode="character", length=0)
}
x[i] <- paste(as.character(as.vector(files.df[i, ])), sep = "", collapse = "")
}
Through a lot of trial and error using (regex101, thanks @Onyambu!), I put together the following ridiculous regex that actually works:
sum(grepl(
"^.:/(([^/]+)(?=/)/?)*(?<=/)(([^/](?! - Copy| \\([0-9]+\\)|\\.[^/\\.]+$))+.)( - )?((?<= - )Copy(?= \\([0-9]+\\)(?=\\.[^/\\.]+$|$)|\\.[^/\\.]+$|$))?( \\()?((?<= \\()([0-9]+)\\)(?=\\.[^/\\.]+$|$))?\\.?((?<=\\.)([^/\\.]+))?$",
x,
perl = T
))
[1] 1620
length(x)
[1] 1620
Unfortunately, this regex includes 10 capturing groups, and I can only backreference 9 of them (and #10 is the file extension). So I'll be using @RHertel's much more elegant solution. But if anyone sees a way to reduce the number of capturing groups, let me know!
Upvotes: 0
Reputation: 79318
I still do not know whether you need them as a string: like below
gsub("[/().]| - "," ",sub(".*?([^/]+/[^/]+$)","\\1",files_list))
[1] "Sub-subfolder file txt"
[2] "Sub-subfolder file Copy txt"
[3] "Sub-subfolder file 1 txt"
[4] "Sub-subfolder file Copy 2 txt"
[5] "Subfolder1 fileB txt"
[6] "Folder file C txt"
[7] "Folder file-D txt"
[8] "Folder file"
If you just need one pattern then:
pattern="[^/]+(?=/[^/]+$)|\\w+(?=[ ).-])|\\w+$"
regmatches(files_list,gregexpr(pattern,files_list,perl = TRUE))
Upvotes: 1
Reputation: 23808
This might help:
library(tools)
as.data.frame(cbind(dirname(files_list), file_path_sans_ext(basename(files_list)), file_ext(files_list)))
# V1 V2 V3
#1 C:/User/Name/Folder/Subfolder1/Sub-subfolder file txt
#2 C:/User/Name/Folder/Subfolder1/Sub-subfolder file - Copy txt
#3 C:/User/Name/Folder/Subfolder1/Sub-subfolder file (1) txt
#4 C:/User/Name/Folder/Subfolder1/Sub-subfolder file - Copy (2) txt
#5 C:/User/Name/Folder/Subfolder1 fileB txt
#6 C:/User/Name/Folder file.C txt
#7 C:/User/Name/Folder file-D txt
#8 C:/User/Name/Folder file
Upvotes: 2