Pharny
Pharny

Reputation: 17

How to extract text from a column using R

How would I go about extracting, for each row (there are ~56,000 records in an Excel file) in a specific column, only part of a string? I need to keep all text to the left of the last '/' forward slash. The challenge is that not all cells have the same number of '/'. There is always a filename (*.wav) at the end of the last '/', but the number of characters in the filename is not always the same (sometimes 5 and sometimes 6).

Below are some examples of the strings in the cells:
cloch/51.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav
AB_AeolinaL/025-C#.wav
AB_AeolinaL/026-D.wav
AB_violadamourL/rel99999/091-G.wav
AB_violadamourL/rel99999/092-G#.wav
AB_violadamourR/024-C.wav
AB_violadamourR/025-C#.wav

The extracted text should be:
cloch
grand/Grand_bombarde/02-suchy_Grand_bombarde
grand/Grand_bombarde/02-suchy_Grand_bombarde
AB_AeolinaL
AB_AeolinaL
AB_violadamourL/rel99999
AB_violadamourL/rel99999
AB_violadamourR
AB_violadamourR

Can anyone recommend a strategy using R?

Upvotes: 0

Views: 1713

Answers (5)

The fourth bird
The fourth bird

Reputation: 163197

You could start the match with / followed by 1 or more times any char except a forward slash or a whitespace char using a negated character class [^\\s/]+

Then match .wav at the end of the string using $

Replace the match with an empty string using sub for example.

[^\\s/]+\\.wav$

See the regex matches | R demo

strings <- c("cloch/51.wav",
             "grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
             "grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
             "AB_AeolinaL/025-C#.wav",
             "AB_AeolinaL/026-D.wav",
             "AB_violadamourL/rel99999/091-G.wav",
             "AB_violadamourL/rel99999/092-G#.wav",
             "AB_violadamourR/024-C.wav",
             "AB_violadamourR/025-C#.wav")

sub("/[^\\s/]+\\.wav$", "", strings)

Output

[1] "cloch"                                       
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"                                 
[5] "AB_AeolinaL"                                 
[6] "AB_violadamourL/rel99999"                    
[7] "AB_violadamourL/rel99999"                    
[8] "AB_violadamourR"                             
[9] "AB_violadamourR"

Upvotes: 0

user12728748
user12728748

Reputation: 8506

You could use

dirname(strings)

If there is no /, this returns ., which you could remove afterwards if you like, e.g.:

res <- dirname(strings)
res[res=="."] <- ""
``

Upvotes: 0

itsDV7
itsDV7

Reputation: 854

You can use the stringr package str_remove(string,pattern) function like:

str = "grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav"
str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")

Output:

> str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
[1] "grand/Grand_bombarde/02-suchy_Grand_bombarde"

Then you can just iterate over all other strings:

strings <- c("cloch/51.wav",
             "grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
             "grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
             "AB_AeolinaL/025-C#.wav",
             "AB_AeolinaL/026-D.wav",
             "AB_violadamourL/rel99999/091-G.wav",
             "AB_violadamourL/rel99999/092-G#.wav",
             "AB_violadamourR/024-C.wav",
             "AB_violadamourR/025-C#.wav")
str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")

Output:

> str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
[1] "cloch"                                       
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"                                 
[5] "AB_AeolinaL"                                 
[6] "AB_violadamourL/rel99999"                    
[7] "AB_violadamourL/rel99999"                    
[8] "AB_violadamourR"                             
[9] "AB_violadamourR"  

Upvotes: 3

Leonardo
Leonardo

Reputation: 2485

Assuming that the strings you propose are in a column of a dataframe:

df <- data.frame(x = 1:5, y = c("cloch/51.wav", 
                                "grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav", 
                                "grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav", 
                                "AB_AeolinaL/025-C#.wav", 
                                "AB_AeolinaL/026-D.wav"))

# I define a function that separates a string at each "/" 
# throws the last piece and reattaches the pieces

cut_str <- function(s) {
  st <- head((unlist(strsplit(s, "\\/"))), -1)
  r <- paste(st, collapse = "/")
  return(r)
}

# through the sapply function I get the desired result

new_strings <- as.vector(sapply(df$y, FUN = cut_str))
new_strings

[1] "cloch"                                       
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"                                 
[5] "AB_AeolinaL" 

Upvotes: 0

Terru_theTerror
Terru_theTerror

Reputation: 5017

You have to substract strings using this method:

substr(strings,1,regexpr("\\/[^\\/]*$", strings)-1)
[1] "cloch"                                       
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"                                 
[5] "AB_AeolinaL"                                 
[6] "AB_violadamourL/rel99999"                    
[7] "AB_violadamourL/rel99999"                    
[8] "AB_violadamourR"                             
[9] "AB_violadamourR"

Input

strings<-c("cloch/51.wav","grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav","grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav","AB_AeolinaL/025-C#.wav","AB_AeolinaL/026-D.wav","AB_violadamourL/rel99999/091-G.wav","AB_violadamourL/rel99999/092-G#.wav","AB_violadamourR/024-C.wav","AB_violadamourR/025-C#.wav")

In which this regex regexpr("\\/[^\\/]*$", strings) gives you the position of the last "/"

Upvotes: 0

Related Questions