aStarIsCorn
aStarIsCorn

Reputation: 89

Stringr pattern to detect capitalized words

I am trying to write a function to detect capitalized words that are all capitalised

currently, code:

df <- data.frame(title = character(), id = numeric())%>%
        add_row(title= "THIS is an EXAMPLE where I DONT get the output i WAS hoping for", id = 6)

df <- df %>%
        mutate(sec_code_1 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][1]) 
               , sec_code_2 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][2]) 
               , sec_code_3 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][3]))
df

Where output is:

title id sec_code_1 sec_code_2 sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for 6 DONT WAS

The first 3-5 letter capitalized word is "THIS", second should skip example (>5) and be "DONT", third example should be "WAS". ie:

title id sec_code_1 sec_code_2 sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for 6 THIS DONT WANT

does anyone know where Im going wrong? specifically how I can denote "space or beginning of string" or "space or end of string" logically using stringr.

Upvotes: 2

Views: 311

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388817

If you run the code with your regex you'll realise 'THIS' is not included in the output at all.

str_extract_all(df$title," [A-Z]{3,5} ")[[1]]
#[1] " DONT " " WAS " 

This is because you are extracting words with leading and lagging whitespace. 'THIS' does not have lagging whitespace because it is start of the sentence, hence it does not satisfy the regex pattern. You can use word boundaries (\\b) instead.

str_extract_all(df$title,"\\b[A-Z]{3,5}\\b")[[1]]
#[1] "THIS" "DONT" "WAS"

Your code would work if you use the above pattern in it.

Or you could also use :

library(tidyverse)

df %>%
  mutate(code = str_extract_all(title,"\\b[A-Z]{3,5}\\b")) %>%
  unnest_wider(code) %>%
  rename_with(~paste0('sec_code_', seq_along(.)), starts_with('..'))

# title                                     id sec_code_1 sec_code_2 sec_code_3
#  <chr>                                  <dbl> <chr>      <chr>      <chr>     
#1 THIS is an EXAMPLE where I DONT get t…     6 THIS       DONT       WAS 

Upvotes: 2

Related Questions