Stringr pattern to detect capitalized words

Question

I am trying to write a function to detect capitalized words that are all capitalised

currently, code:

df <- data.frame(title = character(), id = numeric())%>%
        add_row(title= "THIS is an EXAMPLE where I DONT get the output i WAS hoping for", id = 6)

df <- df %>%
        mutate(sec_code_1 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][1]) 
               , sec_code_2 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][2]) 
               , sec_code_3 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][3]))
df

Where output is:

title	id	sec_code_1	sec_code_2	sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for	6	DONT	WAS

The first 3-5 letter capitalized word is "THIS", second should skip example (>5) and be "DONT", third example should be "WAS". ie:

title	id	sec_code_1	sec_code_2	sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for	6	THIS	DONT	WANT

does anyone know where Im going wrong? specifically how I can denote "space or beginning of string" or "space or end of string" logically using stringr.

Ronak Shah · Accepted Answer

If you run the code with your regex you'll realise 'THIS' is not included in the output at all.

str_extract_all(df$title," [A-Z]{3,5} ")[[1]]
#[1] " DONT " " WAS "

This is because you are extracting words with leading and lagging whitespace. 'THIS' does not have lagging whitespace because it is start of the sentence, hence it does not satisfy the regex pattern. You can use word boundaries (\b) instead.

str_extract_all(df$title,"\b[A-Z]{3,5}\b")[[1]]
#[1] "THIS" "DONT" "WAS"

Your code would work if you use the above pattern in it.

Or you could also use :

library(tidyverse)

df %>%
  mutate(code = str_extract_all(title,"\b[A-Z]{3,5}\b")) %>%
  unnest_wider(code) %>%
  rename_with(~paste0('sec_code_', seq_along(.)), starts_with('..'))

# title                                     id sec_code_1 sec_code_2 sec_code_3
#                                                      
#1 THIS is an EXAMPLE where I DONT get t…     6 THIS       DONT       WAS

Stringr pattern to detect capitalized words

Answers (1)

Related Questions