JMDR
JMDR

Reputation: 121

Split parts of string defined by multiple delimiters into multiple variables in R

I have a large list of file names that I need to extract information from using R. The info is delimited by multiple dashes and underscores. I am having trouble figuring out a method that will accommodate the fact that the number of characters between delimiters is not consistent (the order of the information will remain constant, as will the delimiters used (hopefully)).

For example:

 f <- data.frame(c("EI-SM4-AMW11_20160614_082800.wav", "PA-RF-A50_20160614_082800.wav"), stringsAsFactors = FALSE)
 colnames(f)<-"filename"
 f$area <- str_sub(f$filename, 1, 2)
 f$rec <- str_sub(f$filename, 4, 6)
 f$site <- str_sub(f$filename, 8, 12)

This produces correct results for the first file, but incorrect results for the second.

I've tried using the "stringr" and "stringi" packages, and know that hard coding the values in doesn't work, so I've come up with awkward solutions using both packages such as:

f$site <- str_sub(f$filename, 
                  stri_locate_last(f$filename, fixed="-")[,1]+1, 
                  stri_locate_first(f$filename, fixed="_")[,1]-1)

I feel like there must be a more elegant (and robust) method, perhaps involving regex (which I am painfully new to).

I've looked at other examples (Extract part of string (till the first semicolon) in R, R: Find the last dot in a string, Split string using regular expressions and store it into data frame).

Any suggestions/pointers would be very much appreciated.

Upvotes: 1

Views: 1704

Answers (2)

thepule
thepule

Reputation: 1751

Something like this:

library(stringr)
library(dplyr)

f$area <- word(f$filename, 1, sep = "-")
f$rec <- word(f$filename, 2, sep = "-")
f$site <- word(f$filename, 3, sep = "-") %>%
        word(1,sep = "_")        

dplyr is not necessary but makes concatenation cleaner. The function word belongs to stringr.

Upvotes: 0

RoyalTS
RoyalTS

Reputation: 10203

Try this, from the `tidyr' package:

library(tidyr)

f %>% separate(filename, c('area', 'rec', 'site'), sep = '-')

You can also split along multiple difference delimeters, like so:

f %>% separate(filename, c('area', 'rec', 'site', 'date', 'don_know_what_this_is', 'file_extension'), sep = '-|_|\\.')

and then keep only the columns you want using dplyr's select function:

 library(dplyr)
 library(tidyr)

 f %>% 
   separate(filename,
            c('area', 'rec', 'site', 'date',
              'don_know_what_this_is', 'file_extension'), 
            sep = '-|_|\\.') %>%
   select(area, rec, site)

Upvotes: 2

Related Questions