Justus Erker
Justus Erker

Reputation: 87

Python regex doesn't find certain pattern

I am trying to parse latex code from html code which looks like this:

string = " your answer is wrong! Solution: based on \((\vec{n_E},\vec{g})= 0 \) and \(d(g,E)=0\) beeing ... "

I want to replace all latex code with the output of a function that takes the latex code as an argument (Since there is a problem with finding the correct pattern, the function extract returns an empty string for the moment).

I tried:

latex_end = "\)"
latex_start = "\("    
string = re.sub(r'{}.*?{}'.format(latex_start, latex_end), extract, string)

Result:

your answer is wrong! Solution: based on \= 0 \) and \=0\) beeing ...

Expected:

your answer is wrong! Solution: based on and beeing ...

Any idea why it does not find the pattern? Is there a way to implement it?

Upvotes: 1

Views: 144

Answers (2)

Booboo
Booboo

Reputation: 44128

You should use a raw string for your definition of string since \v is being interpreted as a special character.

import re

string = r" your answer is wrong! Solution: based on \((\vec{n_E},\vec{g})= 0 \) and \(d(g,E)=0\) beeing ... "


string = re.sub(r'\\\(.*?\\\)', '', string))
print(string)

Prints:

 your answer is wrong! Solution: based on  and  beeing ...

If you need to have variables for the start and end:

latex_end = r"\\\)"
latex_start = r"\\\("    
string = re.sub(r'{}.*?{}'.format(latex_start, latex_end), '', string)
print(string)

Upvotes: 1

Vadim
Vadim

Reputation: 642

This is because of backslashes serving as escape characters in Python. This makes handling these kinds of situations very tricky. The following are two quick ways of making this work:

import re

extract = lambda a: ""

# Using no raw components
string = " your answer is wrong! Solution: based on \((\vec{n_E},\vec{g})= 0 \) and \(d(g,E)=0\) beeing ... "
latex_bounds = ("\\\(", "\\\)\)")
print(re.sub('{}.*?{}'.format(*latex_bounds), extract, string))

# Using all raw components (backslashes mean nothing, but not really)
string = r"%s" % string
latex_bounds = (r"\\\(", r"\\\)")
print(re.sub(r'{}.*?{}'.format(*latex_bounds), extract, string))

Upvotes: 1

Related Questions