Reputation: 121
I have a large list of file names that I need to extract information from using R. The info is delimited by multiple dashes and underscores. I am having trouble figuring out a method that will accommodate the fact that the number of characters between delimiters is not consistent (the order of the information will remain constant, as will the delimiters used (hopefully)).
For example:
f <- data.frame(c("EI-SM4-AMW11_20160614_082800.wav", "PA-RF-A50_20160614_082800.wav"), stringsAsFactors = FALSE)
colnames(f)<-"filename"
f$area <- str_sub(f$filename, 1, 2)
f$rec <- str_sub(f$filename, 4, 6)
f$site <- str_sub(f$filename, 8, 12)
This produces correct results for the first file, but incorrect results for the second.
I've tried using the "stringr" and "stringi" packages, and know that hard coding the values in doesn't work, so I've come up with awkward solutions using both packages such as:
f$site <- str_sub(f$filename,
stri_locate_last(f$filename, fixed="-")[,1]+1,
stri_locate_first(f$filename, fixed="_")[,1]-1)
I feel like there must be a more elegant (and robust) method, perhaps involving regex (which I am painfully new to).
I've looked at other examples (Extract part of string (till the first semicolon) in R, R: Find the last dot in a string, Split string using regular expressions and store it into data frame).
Any suggestions/pointers would be very much appreciated.
Upvotes: 1
Views: 1704
Reputation: 1751
Something like this:
library(stringr)
library(dplyr)
f$area <- word(f$filename, 1, sep = "-")
f$rec <- word(f$filename, 2, sep = "-")
f$site <- word(f$filename, 3, sep = "-") %>%
word(1,sep = "_")
dplyr
is not necessary but makes concatenation cleaner.
The function word
belongs to stringr
.
Upvotes: 0
Reputation: 10203
Try this, from the `tidyr' package:
library(tidyr)
f %>% separate(filename, c('area', 'rec', 'site'), sep = '-')
You can also split along multiple difference delimeters, like so:
f %>% separate(filename, c('area', 'rec', 'site', 'date', 'don_know_what_this_is', 'file_extension'), sep = '-|_|\\.')
and then keep only the columns you want using dplyr
's select
function:
library(dplyr)
library(tidyr)
f %>%
separate(filename,
c('area', 'rec', 'site', 'date',
'don_know_what_this_is', 'file_extension'),
sep = '-|_|\\.') %>%
select(area, rec, site)
Upvotes: 2