regex: match all subscripts in an html file except a specific one

Question

I want to remove all subscripts from a piece of html code, except the subscript “rep”.

For instance, the string "t_i(10) = 23, p_rep=.2" should become: "t(10) = 23, p_rep=.2"

I was trying things like:

txt <- "t_i(10) = 23, p_rep=.2"
gsub(pattern="_(?!rep).*",replacement="",txt,perl=TRUE)

But the problem is that this line of code deletes everything between the first and the last in the html file...

hwnd · Accepted Answer

It is recommended to use a Parser when dealing with HTML, but to explain your problem...

The issue is that .* will go all the way down the string then eventually backtrack to allow the closing tag to match. As soon as it backtracks to the second closing tag the regular expression will match.

The simple fix is to follow .* with ? to prevent greediness. What this means is look for any character (except newline) and find (0 or more) until you get to a closing tag. Once you specify the question mark ?, you're telling the regex engine (do not be greedy.. as soon as you find a closing tag... stop...)

txt <- 't_i(10) = 23, p_rep=.2'
gsub('_(?!rep).*?', '', txt, perl=T)
# [1] "t(10) = 23, p_rep=.2"

regex: match all subscripts in an html file except a specific one

Answers (2)

Related Questions