Mr. Biggums
Mr. Biggums

Reputation: 207

R/Stringr Extract String after nth occurrence of "_" and end with first occurrence of "_"

Using R and the stringr package (or any other package for that matter)

I want to Extract String after nth occurrence of " _ " and end with first occurrence of "_".

For example:

df <- c("J_J_HERE_jfdkaldjhieuwui","blahblah_ffd_THIS_fjdkalfj_jdka_")

I would want this:

df_edited <- c("HERE","THIS")

OR for this example, I want to extract one everything after one space from "er" and end with the first occurence of _:

df2 <- c("ex HERE_jfdkaldjhieuwui","ex_THIS_fjdkalfj_jdka_")

I would want this:

df_edited <- c("HERE","THIS")

Where's a good cheat sheet to understand stringr conditions cuz I'm confused af.

Upvotes: 3

Views: 2702

Answers (3)

Rob Smith
Rob Smith

Reputation: 561

Another option could be to identify the position of the element(s) in the string. This solution is duplicated for a similar question here. This is a little messy but it achieved what I wanted, and would solve your issue. I also like that I can modify it to suit variety of situations. Although if I could get my head around regex, it would undoubtedly be cleaner and more efficient.

The code below uses a combination of; stringr::str_locate_all() which outputs a list of lists, the first 'column' in the matrix output for each entry is the start value of each occurrence of the pattern. the second column is the end value. Each row of each embedded matrix thus contains the start and end positions of the pattern.

Since I am working in a dataframe and want to use the specific index numbers, I find it easiest to extract the number related to the start of the pattern and save it as a variable in the dataframe.

purrr::map() then allows you to extract a particular value (in this case, the "n^th". I have just extracted the index for the start (i.e. first column) of second occurrence (i.e. second row) from each matrix .x[,1][2], in my example). This value then needs to be unlisted, and stored as a numeric value.

stringr::str_length() then returns the integer length of the string (or, 'total number of characters').

After extracting the specific index values, you then need to extract a substring from position to position. Just remember that special characters need to be properly escaped.

Finally, stringr::str_sub()to extract everything between the n'th occurrence of the particular pattern and the last character in the string.

text_pattern <- "-"
df <- data.table(var_name = c("kj<hdf - fæld - adsk-jf -h af",
                              "kj<hds - sdaf - saflaf- adf",
                              "asdgya - oaid - aa-s--s a-",
                              "k<hdfk - lkja - ljad -"))

df <- df %>%
    mutate(second_dash = as.numeric(unlist
                                    (str_locate_all(pattern = text_pattern, var_name) %>%
                                            map(~ .x[,1][2])
                                        )
        )) %>%
    mutate(New_substring = str_sub(string = var_name, 
                                   start = second_dash+2, 
                                   end = str_length(var_name))) 

#                         var_name second_dash New_substring
# 1: kj<hdf - fæld - adsk-jf -h af          15 adsk-jf -h af
# 2:   kj<hds - sdaf - saflaf- adf          15   saflaf- adf
# 3:    asdgya - oaid - aa-s--s a-          15    aa-s--s a-
# 4:        k<hdfk - lkja - ljad -          15        ljad -

For your particular case, continuing the use of a dash rather than the underscore, you could specify the index numbers (or occurrence numbers) with variables, n and m for example.

In the worked example below, I have added 2 to the start and deducted 2 from the end of the sub-string to remove the spaces. It should also be noted that the index includes the character in question. So if you do not want the dash "-" or underscore "_" included in your output you will need to deduct or add at least 1 from the index you extract. All dependent on your specific purpose. This could also be achieved more intelligently by removing the 'padding' of spaces around the values, but I'm just including the modifications to illustrate how the index values can be manipulated.

text_pattern <- "-"
n = 2
m = n + 1

df <- data.table(var_name = c("kj<hdf - fæld - adsk-jf -h af",
                              "kj<hds - sdaf - saflaf- adf",
                              "asdgya - oaid - aa-s--s a-",
                              "k<hdfk - lkja - ljad -"))

df <- df %>%
    mutate(n_dash = as.numeric(unlist
                                    (str_locate_all(pattern = text_pattern, var_name) %>%
                                            map(~ .x[,1][n])
                                        )
        )) %>%
    mutate(m_dash = as.numeric(unlist
                                    (str_locate_all(pattern = text_pattern, var_name) %>%
                                            map(~ .x[,1][m])
                                        )
        )) %>%
    mutate(New_substring = str_sub(string = var_name, 
                                   start = n_dash+2, 
                                   end = m_dash-2))

#                         var_name New_substring n_dash m_dash
# 1: kj<hdf - fæld - adsk-jf -h af           ads     15     21
# 2:   kj<hds - sdaf - saflaf- adf         safla     15     23
# 3:    asdgya - oaid - aa-s--s a-             a     15     19
# 4:        k<hdfk - lkja - ljad -          ljad     15     22    

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 389155

You can split the data on delimiter so that all the words are readily available.

df <- c("J_J_HERE_jfdkaldjhieuwui","blahblah_ffd_THIS_fjdkalfj_jdka_")
list_word <- strsplit(df, '_')
list_word

#[[1]]
#[1] "J"               "J"               "HERE"            "jfdkaldjhieuwui"

#[[2]]
#[1] "blahblah" "ffd"      "THIS"     "fjdkalfj" "jdka"    

Then you can get any value at position n from the list.

sapply(list_word, `[`, 3)
#[1] "HERE" "THIS"

sapply(list_word, `[`, 2)
#[1] "J"   "ffd"

Upvotes: 2

akrun
akrun

Reputation: 887541

We could create a pattern based on the 'n'

n <- 2
pat <- sprintf('([^_]+_){%d}([^_]+)_.*', n)
sub(pat, '\\2', df)
#[1] "HERE" "THIS"

Details -

Capture one or more characters that are not a _ ([^_]+) followed by a _ that is repeated 'n' times (2), followed by the next set of characters that are not a _ (([^_]+)) followed by a _ and other characters. In the replacement, specify the backreference of the second captured group

Upvotes: 2

Related Questions