squirrat
squirrat

Reputation: 405

use grepl() to match multiple patterns on data R

This command works to subset the data filelist to remove all "jpg" files.

filetype.isnotjpg <- setdiff(filelist, subset(filelist, grepl("\\.jpg$", filelist)))

So this takes the string "filelist" which contains names of files from a directory. I want to return all files that are not of type "jpg", "doc", "pdf", "xls", etc. I want to be able to specify as many types as I want to filter the list.

Ideally something like

target.files <- setdiff(filelist, subset(filelist, grepl( 
    c("\\.jpg$", "\\.doc$", "\\.pdf$", "\\xls$"), filelist)

This recursive algorithm works to do what I want:

a <- setdiff(files.list, subset(files.list, grepl("\\.tmp", files.list, ignore.case = TRUE)))

a <- setdiff(a, subset(a, grepl("\\.jpg", a, ignore.case = TRUE)))
a <- setdiff(a, subset(a, grepl("\\.pdf", a, ignore.case = TRUE)))
a <- setdiff(a, subset(a, grepl("\\.tif", a, ignore.case = TRUE)))

etc. Something like apply() might work? I'm new to R sorry.

The solution of 42 works:

      target.files <- setdiff(
        files.list, 
        subset(files.list, 
               grepl( 
                 paste(
                   c("\\.jpg", "\\.doc", "\\.pdf", 
                     "\\.xls", "\\.tif", "\\.docx", "\\.xlsx", "\\.jpeg"), 
                   collapse="|") , 
                 files.list, 
                 ignore.case = TRUE)))

Upvotes: 1

Views: 2092

Answers (2)

James
James

Reputation: 66834

You can use file_ext in tools to extract the extension from a filename. Then you can just see if they are in your list and use standard vector subsetting:

filelist[!(tools::file_ext(filelist) %in% c("jpg","jpeg","doc","pdf","xls"))]

If you need to ignore case, you can wrap a tolower around the list or extensions.

Upvotes: 1

IRTFM
IRTFM

Reputation: 263352

I would try paste()-ing with a collapsing separator of "|" which is the OR operator for regex:

target.files <- setdiff(filelist, subset(filelist, grepl( paste(
c("\\.jpg$", "\\.doc$", "\\.pdf$", "\\xls$"), collapse="|") , filelist)

Did you know that the list.files function also accepts a pattern argument so you could do this in a single step with something like:

 my_files <- list.files(path="/path/to/dir/", 
                        pattern=paste( c("\\.jpg$", "\\.doc$", "\\.pdf$", "\\xls$"), 
                                       collapse="|") )

Upvotes: 3

Related Questions