henry
henry

Reputation: 965

Regex replace string which is before or after two different string

I have this string (html):

html = 'x<sub>i</sub> - y<sub>i)<sub>2</sub>' 

I would like to convert this html string to latex in a robust way. Let me explain:

  1. <sub>SOMETHING</sub> -> converted to _{SOMETHING}

I already know how to do that:

latex = re.sub(r'<sub>(.*?)</sub>',r'_{\1} ', html)
  1. Sometimes the first part <sub> or its closing tag is missing, like in the example string. In that case, the output should still be correct.

So how I was thinking of doing it is: After running 1, I take the string after <sub> and anything before </sub> with _{SOMETHING}

text = re.sub(r'<sub>(.*?)</sub>',r'_{\1} ', html)
print(text)
# if missing part:
text = re.sub(r'<sub>(.*?)',r'_{\1} ', text)
print(text)
latex  = re.sub(r'(.*?)</sub>',r'_{\1} ', text)

… but I get:

x_{i}  - y_{i)<sub>2} 
x_{i}  - y_{i)_{} 2} 
x_{i}  - y_{i)_{} 2} 

What I would like to get:

x_{i}  - y_{i})_{2}

Upvotes: 1

Views: 70

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626896

Assuming you have texts that are segmented into different parts, the corresponding <sub> / </sub> tags may reside in the adjoining segments, so it should suffice to just replace them one by one separately, and you do not need to make any guess work.

Just use

text = text.replace('<sub>', '_{').replace('</sub>', '}')

to replace each <sub> with _{ and </sub> with } in any context.

Upvotes: 2

wjandrea
wjandrea

Reputation: 32987

You need to use greedy regexes (i.e. without ?) for the unmatched tags, otherwise you'll always get zero-width matches.

>>> text = '1<sub>2'
>>> re.sub(r'<sub>(.*)', r'_{\1} ', text)
'1_{2} '

BTW while figuring this out, I noticed you can put the second two regexes together like this:

re.sub(r'<sub>(.*)|(.*)</sub>', r'_{\1\2} ', text)

Upvotes: 1

Related Questions