alexb523
alexb523

Reputation: 728

R - Extract info after nth occurrence of a character from the right of string

I've seen many iterations of extracting w/ gsub but they mostly deal with extracting from left to right or after one occurrence. I am wanting to match from right to left, counting four occurrences of -, matching everything between the 3rd and 4th occurrence.

For example:

string                       outcome
here-are-some-words-to-try   some
a-b-c-d-e-f-g-h-i            f

Here are a few references I've tried using:

Upvotes: 12

Views: 19820

Answers (4)

Rob Smith
Rob Smith

Reputation: 561

Another option could be to identify the position of the element(s) in the string. This solution is duplicated for a similar question here.

This is a little messy but it achieved what I wanted, and would solve your issue. I also like that I can modify it to suit variety of situations. Although if I could get my head around regex, it would undoubtedly be cleaner and more efficient.As in Jan's solution.

The code below uses a combination of; stringr::str_locate_all() which outputs a list of lists, the first 'column' in the matrix output for each entry is the start value of each occurrence of the pattern. the second column is the end value. Each row of each embedded matrix thus contains the start and end positions of the pattern.

Since I am working in a dataframe and want to use the specific index numbers, I find it easiest to extract the number related to the start of the pattern and save it as a variable in the dataframe.

purrr::map() then allows you to extract a particular value (in this case, the "n^th". I have just extracted the index for the start (i.e. first column) of second occurrence (i.e. second row) from each matrix .x[,1][2], in my example). This value then needs to be unlisted, and stored as a numeric value.

stringr::str_length() then returns the integer length of the string (or, 'total number of characters').

After extracting the specific index values, you then need to extract a substring from position to position. Just remember that special characters need to be properly escaped.

Finally, stringr::str_sub() is used to extract everything between the n'th occurrence of the particular pattern and the last character in the string.

text_pattern <- "-"
df <- data.table(var_name = c("kj<hdf - fæld - adsk-jf -h af",
                              "kj<hds - sdaf - saflaf- adf",
                              "asdgya - oaid - aa-s--s a-",
                              "k<hdfk - lkja - ljad -"))

df <- df %>%
    mutate(second_dash = as.numeric(unlist
                                    (str_locate_all(pattern = text_pattern, var_name) %>%
                                            map(~ .x[,1][2])
                                        )
        )) %>%
    mutate(New_substring = str_sub(string = var_name, 
                                   start = second_dash+2, 
                                   end = str_length(var_name))) 

#                         var_name second_dash New_substring
# 1: kj<hdf - fæld - adsk-jf -h af          15 adsk-jf -h af
# 2:   kj<hds - sdaf - saflaf- adf          15   saflaf- adf
# 3:    asdgya - oaid - aa-s--s a-          15    aa-s--s a-
# 4:        k<hdfk - lkja - ljad -          15        ljad -

For your particular case, continuing the use of a dash rather than the underscore, you could specify the index numbers (or occurrence numbers) with variables, n and m for example.

In the worked example below, I have added 2 to the start and deducted 2 from the end of the sub-string to remove the spaces. It should also be noted that the index includes the character in question. So if you do not want the dash "-" or underscore "_" included in your output you will need to deduct or add at least 1 from the index you extract. All dependent on your specific purpose. This could also be achieved more intelligently by removing the 'padding' of spaces around the values, but I'm just including the modifications to illustrate how the index values can be manipulated.

text_pattern <- "-"
n = 2
m = n + 1

df <- data.table(var_name = c("kj<hdf - fæld - adsk-jf -h af",
                              "kj<hds - sdaf - saflaf- adf",
                              "asdgya - oaid - aa-s--s a-",
                              "k<hdfk - lkja - ljad -"))

df <- df %>%
    mutate(n_dash = as.numeric(unlist
                                    (str_locate_all(pattern = text_pattern, var_name) %>%
                                            map(~ .x[,1][n])
                                        )
        )) %>%
    mutate(m_dash = as.numeric(unlist
                                    (str_locate_all(pattern = text_pattern, var_name) %>%
                                            map(~ .x[,1][m])
                                        )
        )) %>%
    mutate(New_substring = str_sub(string = var_name, 
                                   start = n_dash+2, 
                                   end = m_dash-2))

#                         var_name New_substring n_dash m_dash
# 1: kj<hdf - fæld - adsk-jf -h af           ads     15     21
# 2:   kj<hds - sdaf - saflaf- adf         safla     15     23
# 3:    asdgya - oaid - aa-s--s a-             a     15     19
# 4:        k<hdfk - lkja - ljad -          ljad     15     22    

Upvotes: 0

Jan
Jan

Reputation: 43169

You could use

([^-]+)(?:-[^-]+){3}$

See a demo on regex101.com.


In R this could be

library(dplyr)
library(stringr)
df <- data.frame(string = c('here-are-some-words-to-try', 'a-b-c-d-e-f-g-h-i', ' no dash in here'), stringsAsFactors = FALSE)

df <- df %>%
  mutate(outcome = str_match(string, '([^-]+)(?:-[^-]+){3}$')[,2])
df

And yields

                      string outcome
1 here-are-some-words-to-try    some
2          a-b-c-d-e-f-g-h-i       f
3            no dash in here    <NA>

Upvotes: 9

d.b
d.b

Reputation: 32548

x = c("here-are-some-words-to-try", "a-b-c-d-e-f-g-h-i")
sapply(x, function(strings){
    ind = unlist(gregexpr(pattern = "-", text = strings))
    if (length(ind) < 4){NA}
    else{substr(strings, ind[length(ind) - 3] + 1, ind[length(ind) - 2] - 1)}
})
#here-are-some-words-to-try          a-b-c-d-e-f-g-h-i 
#                    "some"                        "f" 

Upvotes: 2

denrou
denrou

Reputation: 640

How about splitting your sentence ? Something like

string <- "here-are-some-words-to-try"

# separate all words
val <- strsplit(string, "-")[[1]]

# reverse the order
val rev(val)

# take the 4th element
val[4]

# And using a dataframe
library(tidyverse)
tibble(string = c("here-are-some-words-to-try", "a-b-c-d-e-f-g-h-i")) %>% 
mutate(outcome = map_chr(string, function(s) rev(strsplit(s, "-")[[1]])[4]))

Upvotes: 1

Related Questions