Reputation: 3039
I have a list of substrings with the following pattern:
my.list <- list("file1\\subfile1-D.ext", "file12\\subfile9-D.ext", "file2\\subfile113-D.ext")
and so on. I'd like to extract the file numbers and the subfile-numbers into a numeric data frame containing the file/subfile numbers. So far, I've been using the following approach:
extract.file <- function(file.name){
file.name <- sub("file", "", file.name)
file.name <- sub("\\\\*subfile.*", "", file.name)
}
extract.subfile <- function(subfile.name){
subfile.name <- sub("file.*subfile", "", subfile.name)
subfile.name <- sub("-D.ext", "", subfile.name)
}
name.file <- lapply(my.list, extract.file)
name.file <- as.numeric(unlist(name.file))
name.subfile <- lapply(my.list, extract.subfile)
name.subfile <- as.numeric(unlist(name.subfile))
my.df <- data.frame(file=name.file, subfile=name.subfile)
I've also played around with first extracting the string locations with substring.location
from stringr
library (which yields another list with start and end values), and then looping over the two lists, but this gets too complicated again. Is there a better way to achieve the goal?
Upvotes: 1
Views: 697
Reputation: 5958
Some alternatives:
[Edit: strsplit can take an array and return a list, and shaves time in about half compared to nesting an apply within the rbind call.]
my.df <- do.call( rbind, strsplit( unlist(my.list), split="(\\\\|-D.ext)" ) )
my.df <- data.frame( my.df )
names( my.df ) <- c("file", "subfile")
or
my.df <- do.call( rbind, strsplit( unlist(my.list), split="[^[:alnum:]]" ) )[, 1:2]
my.df <- data.frame( my.df )
names( my.df ) <- c("file", "subfile")
One thing about doing things this way is that you are left with pretty worthless and redundant data if all of the input follows the original my.list
sample.
Perhaps a better solution might be;
# Not sure why strsplit() returns an empty string on the first non-digit match,
# but it does and we account for it by dropping the first returned column.
my.list <- unlist( my.list )
my.df <- do.call( rbind, strsplit( my.list, split="[^[:digit:]]+" ) )[,-1]
my.df <- data.frame( my.list, my.df )
names( my.df ) <- c( "orig", "file", "subfile" )
We've saved quite a bit of memory/storage without all of that duplication and we gain the ability to manipulate things without fussing with text/character ordering/representation.
Check ?strsplit
, ?regex
, and ?grep
for the matching stuff.
The data.frame setup is pretty straight forward... strsplit takes a vector and returns a list, while do.call requires a list to bind together.
Upvotes: 5
Reputation: 179418
Here is a regex with backreferences that seems to do what you ask for:
sapply(my.list, function(x)gsub(".*\\\\(.*)-D\\.ext", "\\1", x))
[1] "subfile1" "subfile9" "subfile113"
The "\\1"
is a backreference that returns the value of the string inside the parentheses.
Upvotes: 2