Reputation: 89
I am trying to write a function to detect capitalized words that are all capitalised
currently, code:
df <- data.frame(title = character(), id = numeric())%>%
add_row(title= "THIS is an EXAMPLE where I DONT get the output i WAS hoping for", id = 6)
df <- df %>%
mutate(sec_code_1 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][1])
, sec_code_2 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][2])
, sec_code_3 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][3]))
df
Where output is:
title | id | sec_code_1 | sec_code_2 | sec_code_3 |
---|---|---|---|---|
THIS is an EXAMPLE where I DONT get the output i WAS hoping for | 6 | DONT | WAS |
The first 3-5 letter capitalized word is "THIS", second should skip example (>5) and be "DONT", third example should be "WAS". ie:
title | id | sec_code_1 | sec_code_2 | sec_code_3 |
---|---|---|---|---|
THIS is an EXAMPLE where I DONT get the output i WAS hoping for | 6 | THIS | DONT | WANT |
does anyone know where Im going wrong? specifically how I can denote "space or beginning of string" or "space or end of string" logically using stringr.
Upvotes: 2
Views: 311
Reputation: 388817
If you run the code with your regex you'll realise 'THIS'
is not included in the output at all.
str_extract_all(df$title," [A-Z]{3,5} ")[[1]]
#[1] " DONT " " WAS "
This is because you are extracting words with leading and lagging whitespace. 'THIS'
does not have lagging whitespace because it is start of the sentence, hence it does not satisfy the regex pattern. You can use word boundaries (\\b
) instead.
str_extract_all(df$title,"\\b[A-Z]{3,5}\\b")[[1]]
#[1] "THIS" "DONT" "WAS"
Your code would work if you use the above pattern in it.
Or you could also use :
library(tidyverse)
df %>%
mutate(code = str_extract_all(title,"\\b[A-Z]{3,5}\\b")) %>%
unnest_wider(code) %>%
rename_with(~paste0('sec_code_', seq_along(.)), starts_with('..'))
# title id sec_code_1 sec_code_2 sec_code_3
# <chr> <dbl> <chr> <chr> <chr>
#1 THIS is an EXAMPLE where I DONT get t… 6 THIS DONT WAS
Upvotes: 2