Josh
Josh

Reputation: 1337

Extracting folder, filename and extension from path

I have a data frame containing file paths and names formatted as follows:

files_list <- c(
  "C:/User/Name/Folder/Subfolder1/Sub-subfolder/file.txt", 
  "C:/User/Name/Folder/Subfolder1/Sub-subfolder/file - Copy.txt",
  "C:/User/Name/Folder/Subfolder1/Sub-subfolder/file (1).txt",
  "C:/User/Name/Folder/Subfolder1/Sub-subfolder/file - Copy (2).txt",
  "C:/User/Name/Folder/Subfolder1/fileB.txt",
  "C:/User/Name/Folder/file.C.txt",
  "C:/User/Name/Folder/file-D.txt", 
  "C:/User/Name/Folder/file",
  "C:/User/Name/Folder/file Z.txt", 
  "C:/User/Name/Folder/file - backup.txt"
)

Every file has a parent folder and a name. These names may include one or more periods "." and/or dashes "-". In addition, some have a "Copy" notation, number designation, and/or file extension. I want to convert the data to something that looks like this:

[1]  "Sub-subfolder   file   txt"
[2]  "Sub-subfolder   file   Copy   txt"
[3]  "Sub-subfolder   file   1   txt"
[4]  "Sub-subfolder   file   Copy   2   txt"
[5]  "Subfolder1   fileB   txt"
[6]  "Folder   file.C   txt"
[7]  "Folder   file-D   txt"
[8]  "Folder   file"
[9]  "Folder   file Z   txt"
[10] "Folder   file - backup   txt"

This is the code that I think should do the trick:

sub(
  "(^.:/)([^/.]+/)*([^/.]+/)([^/]+)(\\s-\\sCopy)?(\\s\\(([0-9]+)\\))?(\\.([^.]+))?$", 
  "\\3   \\4   \\5   \\7   \\9",
  files_list
)

But what I get is this:

[1] "Sub-subfolder/   file.txt         "           
[2] "Sub-subfolder/   file - Copy.txt         "    
[3] "Sub-subfolder/   file (1).txt         "       
[4] "Sub-subfolder/   file - Copy (2).txt         "
[5] "Subfolder1/   fileB.txt         "             
[6] "Folder/   file.C.txt         "                
[7] "Folder/   file-D.txt         "                

The slashes "/" and extra spaces I can deal with, but the "Copy" notations, number designations, and file extensions are not being set apart as I expect.

Any suggestions on how to identify the "Copy" notations, number designations, and file extensions? Or should I just identify the parent folders in one line of code and separate the rest in another line?

(Ultimately, I'm going to convert these text strings into a data frame with the folder, filename, copy designation, and extension are separate columns. I'm pretty sure I could do this with tidyr::separate, but even that requires an understanding of regex, and I want to learn how to use () and back references.)

Upvotes: 0

Views: 538

Answers (3)

Josh
Josh

Reputation: 1337

Sorry if this isn't the best way to do this. I've realized that my question was incomplete, and I want to make the question more complete while also sharing the solution I came up with.

I want this code to deal with the full range of possible name structures:

  1. file in "C:/" or any other directory/subdirectory
  2. file name with any of the following characters/features
    • "." before the "." at the start of the file extension
    • "-" or " - " not part of " - Copy"
    • " " or "(" not part of " (number)" at the end of the file name

I used this code to generate example file names/paths cover all folder/name/Copy/number/extension combinations:

files.df <- expand.grid(
    c("C:/"), 
    c("", "F1/", "F1/F2/"), 
    c("folder/"), 
    c("file"), 
    c("", " space", "-dash", " - spacedash", ".period", ".firstperiod.secondperiod"), 
    c("", 1, " 1", 10, " 10"), 
    c("", " - Copy"), 
    c("", " (1)", " (10)"), 
    c("", ".999", ".aaa"), 
    stringsAsFactors = F
)

for (i in 1:nrow(files.df)) {
    if (!exists("x")) {
        x <- vector(mode="character", length=0)
    }
    x[i] <- paste(as.character(as.vector(files.df[i, ])), sep = "", collapse = "")
}

Through a lot of trial and error using (regex101, thanks @Onyambu!), I put together the following ridiculous regex that actually works:

sum(grepl(
    "^.:/(([^/]+)(?=/)/?)*(?<=/)(([^/](?! - Copy| \\([0-9]+\\)|\\.[^/\\.]+$))+.)( - )?((?<= - )Copy(?= \\([0-9]+\\)(?=\\.[^/\\.]+$|$)|\\.[^/\\.]+$|$))?( \\()?((?<= \\()([0-9]+)\\)(?=\\.[^/\\.]+$|$))?\\.?((?<=\\.)([^/\\.]+))?$", 
    x,
    perl = T
))
[1] 1620

length(x)
[1] 1620

Unfortunately, this regex includes 10 capturing groups, and I can only backreference 9 of them (and #10 is the file extension). So I'll be using @RHertel's much more elegant solution. But if anyone sees a way to reduce the number of capturing groups, let me know!

Upvotes: 0

Onyambu
Onyambu

Reputation: 79318

I still do not know whether you need them as a string: like below

gsub("[/().]| - "," ",sub(".*?([^/]+/[^/]+$)","\\1",files_list))

[1] "Sub-subfolder file txt"         
[2] "Sub-subfolder file Copy txt"    
[3] "Sub-subfolder file  1  txt"     
[4] "Sub-subfolder file Copy  2  txt"
[5] "Subfolder1 fileB txt"           
[6] "Folder file C txt"              
[7] "Folder file-D txt"              
[8] "Folder file"  

If you just need one pattern then:

pattern="[^/]+(?=/[^/]+$)|\\w+(?=[ ).-])|\\w+$"
regmatches(files_list,gregexpr(pattern,files_list,perl = TRUE))

Demo

Upvotes: 1

RHertel
RHertel

Reputation: 23808

This might help:

library(tools)
as.data.frame(cbind(dirname(files_list), file_path_sans_ext(basename(files_list)), file_ext(files_list)))
#                                            V1              V2  V3
#1 C:/User/Name/Folder/Subfolder1/Sub-subfolder            file txt
#2 C:/User/Name/Folder/Subfolder1/Sub-subfolder     file - Copy txt
#3 C:/User/Name/Folder/Subfolder1/Sub-subfolder        file (1) txt
#4 C:/User/Name/Folder/Subfolder1/Sub-subfolder file - Copy (2) txt
#5               C:/User/Name/Folder/Subfolder1           fileB txt
#6                          C:/User/Name/Folder          file.C txt
#7                          C:/User/Name/Folder          file-D txt
#8                          C:/User/Name/Folder            file    

Upvotes: 2

Related Questions