Michele
Michele

Reputation: 33

regex: match all subscripts in an html file except a specific one

I want to remove all subscripts from a piece of html code, except the subscript “rep”.

For instance, the string "t<sub>i</sub>(10) = 23, p<sub>rep</sub>=.2" should become: "t(10) = 23, p<sub>rep</sub>=.2"

I was trying things like:

txt <- "t<sub>i</sub>(10) = 23, p<sub>rep</sub>=.2"
gsub(pattern="<sub>(?!rep).*</sub>",replacement="",txt,perl=TRUE)

But the problem is that this line of code deletes everything between the first <sub> and the last </sub> in the html file...

Upvotes: 0

Views: 139

Answers (2)

hwnd
hwnd

Reputation: 70732

It is recommended to use a Parser when dealing with HTML, but to explain your problem...

The issue is that .* will go all the way down the string then eventually backtrack to allow the closing tag to match. As soon as it backtracks to the second closing tag the regular expression will match.

The simple fix is to follow .* with ? to prevent greediness. What this means is look for any character (except newline) and find (0 or more) until you get to a closing tag. Once you specify the question mark ?, you're telling the regex engine (do not be greedy.. as soon as you find a closing tag... stop...)

txt <- 't<sub>i</sub>(10) = 23, p<sub>rep</sub>=.2'
gsub('<sub>(?!rep).*?</sub>', '', txt, perl=T)
# [1] "t(10) = 23, p<sub>rep</sub>=.2"

Upvotes: 1

jdharrison
jdharrison

Reputation: 30425

Use the XML library to parse the html. You can select the nodes you want to remove and use removeNodes:

library(XML)
xData <- htmlParse("t<sub>i</sub>(10) = 23, p<sub>rep</sub>=.2")
remNodes <- xData['//sub[not(contains(., "rep"))]']
removeNodes(remNodes)
> xData
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
  <html><body>t(10) = 23, p<sub>rep</sub>=.2</body></html>

Upvotes: 1

Related Questions