userk
userk

Reputation: 961

R: extract part of a filename

I'm trying to extract part of a filename using R, I have a vague idea about how to go about this from here: extract part of a file name in R however I can't quite get this to work on my list of filenames

example of filenames:

"Species Count (2011-12-15-07-09-39).xls"
"Species Count 0511.xls"
"Species Count 151112.xls" 
"Species Count1011.xls" 
"Species Count2012-01.xls" 
"Species Count201207.xls" 
"Species Count2013-01-15.xls"  

Some of the filenames have a space between Species Count and the date, some without a space, and they are of different lengths and some contain brackets. I just want to extract the numerical part of the filename and to keep the -'s aswell. So for example for the data above I would have:

Expected output:

2011-12-15-07-09-39 , 0511 , 151112 , 1011 , 2012-01 , 201207 , 2013-01-15

Upvotes: 7

Views: 15902

Answers (4)

Joshua Ulrich
Joshua Ulrich

Reputation: 176718

If you're concerned about speed, you can use sub with back-references to extract the portions you want. Also note that perl=TRUE is often faster (according to ?grep).

jj <- function() sub("[^0-9]*([0-9].*[0-9])[^0-9]*", "\\1", tt, perl=TRUE)
aa <- function() regmatches(tt, regexpr("[0-9].*[0-9]", tt, perl=TRUE))

# Run on R-2.15.2 on 32-bit Windows
microbenchmark(arun <- aa(), josh <- jj(), times=25)
# Unit: milliseconds
#           expr       min        lq    median        uq       max
# 1 arun <- aa() 2156.5024 2189.5168 2191.9972 2195.4176 2410.3255
# 2 josh <- jj()  390.0142  390.8956  391.6431  394.5439  493.2545
identical(arun, josh)  # TRUE

# Run on R-3.0.1 on 64-bit Ubuntu
microbenchmark(arun <- aa(), josh <- jj(), times=25)
# Unit: seconds
#          expr      min       lq   median       uq      max neval
#  arun <- aa() 1.794522 1.839044 1.858556 1.894946 2.207016    25
#  josh <- jj() 1.003365 1.008424 1.009742 1.059129 1.074057    25
identical(arun, josh)  # still TRUE

Upvotes: 2

Arun
Arun

Reputation: 118879

Here's one way:

regmatches(tt, regexpr("[0-9].*[0-9]", tt))

I assume that there are no other numbers in your file names. So, we just search for start of a number and use the greedy operator .* so that everything until the last number is captured. This is done using regexpr which'll get the position of matches. Then we use regmatches to extract the (sub)string out of these matched positions.


where tt is:

tt <- c("Species Count (2011-12-15-07-09-39).xls", "Species Count 0511.xls", 
        "Species Count 151112.xls", "Species Count1011.xls", 
        "Species Count2012-01.xls", "Species Count201207.xls", 
        "Species Count2013-01-15.xls")

Benchmarking:

Note: Benchmarking results may differ between Windows and *nix machines (as @Hansi notes below under comments).

Quite some nice answers there. So, it's time for benchmarking :)

tt <- rep(tt, 1e5) # tt is from above

require(microbenchmark)
require(stringr)
aa <- function() regmatches(tt, regexpr("[0-9].*[0-9]", tt))
bb <- function() gsub("[A-z \\.\\(\\)]", "", tt)
cc <- function() str_extract(tt,'([0-9]|[0-9][-])+')

microbenchmark(arun <- aa(), agstudy <- cc(), Jean <- bb(), times=25)
Unit: seconds
            expr      min       lq   median       uq       max neval
    arun <- aa() 1.951362 2.064055 2.198644 2.397724  3.236296    25
 agstudy <- cc() 2.489993 2.685285 2.991796 3.198133  3.762166    25
    Jean <- bb() 7.824638 8.026595 9.145490 9.788539 10.926665    25

identical(arun, agstudy) # TRUE
identical(arun, Jean) # TRUE

Upvotes: 8

agstudy
agstudy

Reputation: 121608

Using stringr package to extract all strings having only digits or digits followed by -:

library(stringr)
str_extract(ll,'([0-9]|[0-9][-])+')

[1] "2011-12-15-07-09-39" "0511"               
    "151112"              "1011"                "2012-01"            
[6] "201207"              "2013-01-15"         

Upvotes: 1

Jean V. Adams
Jean V. Adams

Reputation: 4784

Use the function gsub() to remove all of the letters, spaces, periods, and parentheses. Then you will be left with numbers and hyphens. For example,

x <- c("Species Count (2011-12-15-07-09-39).xls", "Species Count 0511.xls", 
    "Species Count 151112.xls", "Species Count1011.xls", "Species Count2012-01.xls", 
    "Species Count201207.xls", "Species Count2013-01-15.xls")

gsub("[A-z \\.\\(\\)]", "", x)

[1] "2011-12-15-07-09-39" "0511"                "151112"             
[4] "1011"                "2012-01"             "201207"             
[7] "2013-01-15"         

Upvotes: 5

Related Questions