Reputation: 4396

Regular expression how to get middle strings

I want to search for string that occurs between a certain string. For example,

\start

\problem{number}
\subproblem{number}

/* strings that I want to get */

\subproblem{number}

/* strings that I want to get */

\problem{number}
\subproblem{number}
       ...
       ...
\end

More specifically, I want to get problem number and subproblem number and strings between which is answer.

I somewhat came up with expression like

'(\\problem{(.*?)}\n)? \\subproblem{(.*?)} (.*?) (\\problem|\\subproblem|\\end)'

but it seems like it doesn't work as I expect. What is wrong with this expression?

Upvotes: 0

Answers (3)

abarnert

Reputation: 365597

If the question really is "What is wrong with this expression?", here's the answer:

You're trying to match newlines with a .*?. You need (?s) for that to work.
You have explicit spaces and newlines in the middle of the regex that don't have any corresponding characters in the source text. You need (?x) for that to work.

That may not be all that's wrong with the expression. But just adding (?sx), turning it into a raw string (because I don't trust myself to mix Python quoting and regex quoting properly), and removing the \n gives me this:

r'(?sx)(\\problem{(.*?)}? \\subproblem{(.*?)} (.*?)) (\\problem|\\subproblem|\\end)'

That returns 2 matches instead of 0, and it's probably the smallest change to your regex that works.

However, if the question is "How can I parse this?", rather than "What's wrong with my existing attempt?", I think impl's solution makes more sense (and I also agree with the point about using regex to parse TeX being usually a bad idea)—-or, even better, doing it in two steps as Regexident does.

if using regex to parse TeX is not good idea, then what method would you suggest to parse TeX?

First of all, as a general rule of thumb, if I can't write the regex to solve a problem by myself, I don't want to solve it with a regex, because I'll have a hard time figuring it out a few months from now. Sometimes I break it down into subexpressions, or use (?x) and load it up with comments, but usually I look for another way.

More importantly, if you have a real parser that can consume your language and give you a tree (or whatever's appropriate) that you can walk and search—as with, e.g. etree for XML—then you've got 90% of a solution for every problem you're going to come up with in dealing with that language. A quick&dirty regex (especially one you can't write on your own) only gets you 10% of the way to solving the next problem. And more often than not, if I've got a problem today, I'm going to have more of them in the next few months.

So, what's a good parser for TeX in Python? Honestly, I don't know. I know scipy/matplotlib has something that does it, so I'd probably look there first. Beyond that, check Google, PyPI, and maybe tex.stackexchange.com. The first things that turn up in a search are Texcaller and plasTeX. I have no idea how good they are, or if they're appropriate for your use case, but it shouldn't take long to skim the tutorials and find out.

If it turns out that there's nothing out there, and it comes down to writing something myself with, e.g., pyparsing vs. regexes, then it's a tougher choice. Some languages, it's very easy to define just the subset you care about and leave the rest as giant uninterpreted tokens, in which case a real parser will be just as easy as a regex, so you might as well go that way. Other languages, you have to handle half the syntax before you can do anything useful, so I wouldn't even try. I'd have to put a bit of time into thinking about it and experimenting both ways before deciding which way to go.

Upvotes: 2

Regexident

Reputation: 29552

This one:

(?:\\problem\{(.*?)\}\n)?\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)

returns three matches for me:

Match 1:

group 1: "number"
group 2: "number"
group 3: "/* strings that I want to get */"

Match 2:

group 1: null
group 2: "number"
group 3: "/* strings that I want to get */"

Match 3:

group 1: "number"
group 2: "number"
group 3: "       ...\n       ..."

However I'd rather parse it in two steps.

First find the problem's number (group 1) and content (group 2) using:

\\problem\{(.*?)\}\n(.+?)\\end

Then find the subproblem's numbers (group 1) and contents (group 2) inside that content using:

\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)

Upvotes: 2

impl

Reputation: 803

TeX is pretty complicated and I'm not sure how I feel about parsing it using regular expressions.

That said, your regular expression has two issues:

You're using a space character where you should just consume all whitespace
You need to use a lookahead assertion for your final group so that it doesn't get eaten up (because you need to match it at the beginning of the regex the next time around)

Give this a try:

>>> v
'\\start\n\n\\problem{number}\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\problem{number}\n\\subproblem{number}\n       ...\n       ...\n\\end\n'
>>> re.findall(r'(?:\\problem{(.*?)})?\s*\\subproblem{(.*?)}\s*(.*?)\s*(?=\\problem{|\\subproblem{|\\end)', v, re.DOTALL)
[('number', 'number', '/* strings that I want to get */'), ('', 'number', '/* strings that I want to get */'), ('number', 'number', '...\n       ...')]

Upvotes: 2

Regular expression how to get middle strings

Answers (3)

Related Questions