TMorris
TMorris

Reputation: 29

Use stringr to extract the whole word in a string with a particular set of characters in it

I have a series of strings that have a particular set of characters. What I'd like to do is be able to extract just the word from the string with those characters in it, and discard the rest.

I've tried various regex expressions to do it but I either get it to split all the words or it returns the entire string. Following is an example of the kinds of strings. I've been trying to use stringr::str_extract_all() as there are instances where there are more than one word that needs to be pulled out.

data <- c("AlvariA?o, 1961","Andrade-Salas, Pineda-Lopez & Garcia-MagaA?a, 1994", "A?vila & Cordeiro, 2015", "BabiA?, 1922")

result <- unlist(stringr::str_extract_all(data, "regex"))

From this I'd like a result that pulls all the words that has the "A?", like this:

result <- c("AlvariA?o", "MagaA?a", "A?vila", "BabiA"?)

It seems really simple but my regex knowledge is just not cutting it at the moment.

Upvotes: 1

Views: 1717

Answers (2)

GKi
GKi

Reputation: 39657

To match ? it needs to be escaped with \\?, so A\\? will match A?. \\w matches any word character (equivalent to [a-zA-Z0-9_]) and * matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy).

unlist(stringr::str_extract_all(data, "\\w*A\\?\\w*"))
#[1] "AlvariA?o" "MagaA?a"   "A?vila"    "BabiA?"   

Upvotes: 2

Kra.P
Kra.P

Reputation: 15123

I made as function but pretty worse than Gki's...

    library(quanteda)

    set_of_character <- function(dummy, key){
      n <- nchar(key)
      dummy %>% str_split(., " ") %>%
        unlist %>% 
        str_replace(., ",", "") %>%
        sapply(., function(x) {
          x %>%
            tokens("character") %>%
            unlist() %>%
            char_ngrams(n, concatenator = "")
        }) %>%
        sapply(., function(x) (key %in% x)) %>% which(TRUE) %>% names %>%
        return    
    }

for your example,

    set_of_character(data, "A?")
    [1] "AlvariA?o"      "Garcia-MagaA?a" "A?vila"         "BabiA?"

Upvotes: 1

Related Questions