Roman
Roman

Reputation: 17648

gsub or grep regex to find strings but ignoring HTML tags <>

I'm absolutely no regex expert and been stuck at this point:

This is what I have:

a <- paste(c(LETTERS[1:20], "<br/>", LETTERS[21:26]), collapse ="")
[1] "ABCDEFGHIJKLMNOPQRST<br/>UVWXYZ"

I try to find one or more upper case letters and include further HTML tags like bold type <b>, which is working fine for the letter B alone.

 gsub("B", "<b>B</b>", a)
 [1] "A<b>B</b>CDEFGHIJKLMNOPQRST<br/>UVWXYZ"

Or "AB":

b <- c("AB")
gsub(b, paste0("<b>", b, "</b>"), a)
[1] "<b>AB</b>CDEFGHIJKLMNOPQRST<br/>UVWXYZ"

But highlighting a pattern over the <br/> will of course not work with this approach. (e.g. gsub("STU", "<b>STU</b>", a)) So I need a function ignoring the <br/> tag. I started the query with something like ^(?!.*br), but I am not able to get it correctly working. So my expected output would be something like:

b <- c("STU")
# function and expected output:
"ABCDEFGHIJKLMNOPQR<b>ST<br/>U</b>VWXYZ"

Upvotes: 0

Views: 188

Answers (2)

tokiloutok
tokiloutok

Reputation: 467

When you write :

"<b>A<br/>B</b><br/>C"

shouldn't it be ?

 "<b>A</b><br/><b>B</b><br/><b>C<b/>"

you can try:

require(magrittr)  # for the %>% notation


a <- paste(LETTERS[1:3],collapse = "<br/>") 
res <- strsplit(a, "<br/>") %>% unlist %>% ifelse(. %in% LETTERS, sprintf("<b>%s</b>", .), .) %>% paste0(., collapse = "<br/>")
stopifnot(res == "<b>A</b><br/><b>B</b><br/><b>C</b>")

a <- paste(LETTERS[1:5],collapse = "<br/>") 
res <- strsplit(a, "<br/>") %>% unlist %>% ifelse(. %in% LETTERS, sprintf("<b>%s</b>", .), .) %>% paste0(., collapse = "<br/>")
stopifnot(res == "<b>A</b><br/><b>B</b><br/><b>C</b><br/><b>D</b><br/><b>E</b>")

a <- "A<br/>B<br/>f<br/>C<br/>d"
res <- strsplit(a, "<br/>") %>% unlist %>% ifelse(. %in% LETTERS, sprintf("<b>%s</b>", .), .) %>% paste0(., collapse = "<br/>")
stopifnot(res == "<b>A</b><br/><b>B</b><br/>f<br/><b>C</b><br/>d")

Upvotes: 0

Chris S.
Chris S.

Reputation: 2225

You can capture A tag B in parentheses using gsub

gsub("(A<[^>]+>B)", "<b>\\1</b>", a)
[1] "<b>A<br/>B</b><br/>C"

Upvotes: 2

Related Questions