Reputation: 728
I've seen many iterations of extracting w/ gsub
but they mostly deal with extracting from left to right or after one occurrence. I am wanting to match from right to left, counting four occurrences of -
, matching everything between the 3rd and 4th occurrence.
For example:
string outcome
here-are-some-words-to-try some
a-b-c-d-e-f-g-h-i f
Here are a few references I've tried using:
Upvotes: 12
Views: 19820
Reputation: 561
Another option could be to identify the position of the element(s) in the string. This solution is duplicated for a similar question here.
This is a little messy but it achieved what I wanted, and would solve your issue. I also like that I can modify it to suit variety of situations. Although if I could get my head around regex, it would undoubtedly be cleaner and more efficient.As in Jan's solution.
The code below uses a combination of;
stringr::str_locate_all()
which outputs a list of lists, the first 'column' in the matrix output for each entry is the start
value of each occurrence of the pattern. the second column is the end
value. Each row of each embedded matrix thus contains the start and end positions of the pattern.
Since I am working in a dataframe and want to use the specific index numbers, I find it easiest to extract the number related to the start of the pattern and save it as a variable in the dataframe.
purrr::map()
then allows you to extract a particular value (in this case, the "n^th". I have just extracted the index for the start (i.e. first column) of second occurrence (i.e. second row) from each matrix .x[,1][2]
, in my example). This value then needs to be unlisted
, and stored as a numeric
value.
stringr::str_length()
then returns the integer length of the string (or, 'total number of characters').
After extracting the specific index values, you then need to extract a substring from position to position. Just remember that special characters need to be properly escaped
.
Finally, stringr::str_sub()
is used to extract everything between the n'th
occurrence of the particular pattern and the last character in the string.
text_pattern <- "-"
df <- data.table(var_name = c("kj<hdf - fæld - adsk-jf -h af",
"kj<hds - sdaf - saflaf- adf",
"asdgya - oaid - aa-s--s a-",
"k<hdfk - lkja - ljad -"))
df <- df %>%
mutate(second_dash = as.numeric(unlist
(str_locate_all(pattern = text_pattern, var_name) %>%
map(~ .x[,1][2])
)
)) %>%
mutate(New_substring = str_sub(string = var_name,
start = second_dash+2,
end = str_length(var_name)))
# var_name second_dash New_substring
# 1: kj<hdf - fæld - adsk-jf -h af 15 adsk-jf -h af
# 2: kj<hds - sdaf - saflaf- adf 15 saflaf- adf
# 3: asdgya - oaid - aa-s--s a- 15 aa-s--s a-
# 4: k<hdfk - lkja - ljad - 15 ljad -
For your particular case, continuing the use of a dash rather than the underscore, you could specify the index numbers (or occurrence numbers) with variables, n
and m
for example.
In the worked example below, I have added 2 to the start and deducted 2 from the end of the sub-string to remove the spaces. It should also be noted that the index includes the character in question. So if you do not want the dash "-" or underscore "_" included in your output you will need to deduct or add at least 1 from the index you extract. All dependent on your specific purpose. This could also be achieved more intelligently by removing the 'padding' of spaces around the values, but I'm just including the modifications to illustrate how the index values can be manipulated.
text_pattern <- "-"
n = 2
m = n + 1
df <- data.table(var_name = c("kj<hdf - fæld - adsk-jf -h af",
"kj<hds - sdaf - saflaf- adf",
"asdgya - oaid - aa-s--s a-",
"k<hdfk - lkja - ljad -"))
df <- df %>%
mutate(n_dash = as.numeric(unlist
(str_locate_all(pattern = text_pattern, var_name) %>%
map(~ .x[,1][n])
)
)) %>%
mutate(m_dash = as.numeric(unlist
(str_locate_all(pattern = text_pattern, var_name) %>%
map(~ .x[,1][m])
)
)) %>%
mutate(New_substring = str_sub(string = var_name,
start = n_dash+2,
end = m_dash-2))
# var_name New_substring n_dash m_dash
# 1: kj<hdf - fæld - adsk-jf -h af ads 15 21
# 2: kj<hds - sdaf - saflaf- adf safla 15 23
# 3: asdgya - oaid - aa-s--s a- a 15 19
# 4: k<hdfk - lkja - ljad - ljad 15 22
Upvotes: 0
Reputation: 43169
You could use
([^-]+)(?:-[^-]+){3}$
R
this could be
library(dplyr)
library(stringr)
df <- data.frame(string = c('here-are-some-words-to-try', 'a-b-c-d-e-f-g-h-i', ' no dash in here'), stringsAsFactors = FALSE)
df <- df %>%
mutate(outcome = str_match(string, '([^-]+)(?:-[^-]+){3}$')[,2])
df
And yields
string outcome
1 here-are-some-words-to-try some
2 a-b-c-d-e-f-g-h-i f
3 no dash in here <NA>
Upvotes: 9
Reputation: 32548
x = c("here-are-some-words-to-try", "a-b-c-d-e-f-g-h-i")
sapply(x, function(strings){
ind = unlist(gregexpr(pattern = "-", text = strings))
if (length(ind) < 4){NA}
else{substr(strings, ind[length(ind) - 3] + 1, ind[length(ind) - 2] - 1)}
})
#here-are-some-words-to-try a-b-c-d-e-f-g-h-i
# "some" "f"
Upvotes: 2
Reputation: 640
How about splitting your sentence ? Something like
string <- "here-are-some-words-to-try"
# separate all words
val <- strsplit(string, "-")[[1]]
# reverse the order
val rev(val)
# take the 4th element
val[4]
# And using a dataframe
library(tidyverse)
tibble(string = c("here-are-some-words-to-try", "a-b-c-d-e-f-g-h-i")) %>%
mutate(outcome = map_chr(string, function(s) rev(strsplit(s, "-")[[1]])[4]))
Upvotes: 1