Catelinn Xiao
Catelinn Xiao

Reputation: 141

list.files stop looking after searcing in the master folder and the first subfolder in R

Hello I'm using RStudio 0.99.903 for Windows 64 bits. I am in the folder named "UCI HAR Dataset", if I list all the files in this folder and the subfolders using : list.files(recursive = TRUE), all files are listed as below: full list of .txt files

However, I want to improve the code to list all .txt files except for "feature_info" and "README", that's what I used list.files(recursive = TRUE, pattern = "[^\\<_info\\> | ^\\<README\\>].txt"), it worked by removing the two files I don't want, however, it also exclude those under "/train" folder. Can anyone help to clarify why it stops looking at the second subfolder?

Thanks!

Upvotes: 2

Views: 83

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627488

The [^\\<_info\\> | ^\\<README\\>] matches 1 char that is not equal to <, _, i, n, f, o, >, space, |, ^, R, E, D, M, E, as [^...] is a negated bracket expression matching all chars other than those defined in the brackets. Then, then a . matches any char and txt matches a txt as a literal char sequence.

Since you cannot use PCRE regex with list.files, you may get all the files from the specified directory first, and then filter it out with grep that supports PCRE regex with lookarounds that you need here:

>  files <- list.files("C:\\5")
> files
[1] "info.txt"      "README.txt"    "some-text.txt"
> files <<- grep("(?<!^README|^info)\\.txt$", files, perl = TRUE, value = TRUE)
> files
[1] "some-text.txt"

Note that

  • (?<!^README|^info) - is a negative lookbehind that fails the match if there is README or info at the start of the string, and if they are located immediately to the left of the current location (that is right before...)
  • \\. - a single dot (the pattern is \. but we need to double backslashes in the string literals to denote a literal backslash)
  • txt - a literal char sequence
  • $ - end of string.

Upvotes: 1

Related Questions