Deshwal
Deshwal

Reputation: 4162

Regular Expression to split text based on different patterns (within a single expression)

I have some patterns which detect questions and splits on top of that. there are some assumptions which I'm using like:

  1. Every pattern starts with a \n
  2. Every pattern ends with \s+

And how I define a pattern is like:

<NUM>.
Q <NUM>.
Q <NUM>
<Q.NUM.>
<NUM>
Question <NUM>
<Example>
Problem <NUM>
Problem:
<Alphabet><Number>.
<EXAMPLE>
Example <NUM>

Someone suggested the below regex: try the demo

((Q|Question|Problem:?|Example|EXAMPLE)\.? ?\d+\.? ?|(Question|Problem:?|Example|EXAMPLE) ?)

but it captures patterns in the middle which is problematic for me because I can have Q. , Example. 2 in the middle of the string too and is not capturing <NUM>.

This list is based on priority so what I could come up with is building these many expressions and running a loop based on the priority for example:

QUESTIONS = [
    re.compile("\n\d+\."),
    re.compile("\nQ.\s*\d+\."), 
    re.compile("\nExample.\s*\d+\.")
]

but it is very inefficient. How can I club these in one expression?

enter image description here

HERE IS THE TEST STRING:

'TEStlabZ\nEDULABZ\nINTERNATIONAL\nLOGARITHMS AND INDICES\n\nQ.1. (A) Convert each of the following to logarithmic form.\n(i) \\( 5^{2}=25 \\)\n(ii) \\( 3^{-3}=\\frac{1}{27} \\)\n(iii) \\( (64)^{\\frac{1}{3}}=4 \\)\n(iv) \\( 6^{0}=1 \\)\n(v) \\( 10^{-2}=0.01 \\) (vi) \\( 4^{-1}=\\frac{1}{4} \\)\nAns. We know that \\( a^{b}=x \\Rightarrow b=\\log _{a} x \\)\n(i) \\( 5^{2}=25 \\quad \\therefore \\log _{5} 25=2 \\)\n(ii) \\( 3^{-3}=\\frac{1}{27} \\therefore \\log _{3}\\left(\\frac{1}{27}\\right)=-3 \\)\n(iii) \\( (64)^{\\frac{1}{3}}=4 \\therefore \\log _{64} 4=\\frac{1}{3} \\)\n(iv) \\( 6^{0}=1 \\quad \\therefore \\log _{6} 1=0 \\)\n(v) \\( 10^{-2}=0.01 \\therefore \\log _{10}(0.01)=-2 \\)\n(vi) \\( 4^{-1}=\\frac{1}{4} \\therefore \\log _{4}\\left(\\frac{1}{4}\\right)=-1 \\)\nQ.1. (B) Convert each of the following to exponential form.\n(i) \\( \\log _{3} 81=4 \\)\n(ii) \\( \\log _{8} 4=\\frac{2}{3} \\)\n(iii) \\( \\log _{2} \\frac{1}{8}=-3 \\)\n(iv) \\( \\log _{10}(0.01)=-2 \\)\n(v) \\( \\log _{5}\\left(\\frac{1}{5}\\right)=-1 \\) (vi) \\( \\log _{a} 1=0 \\)\nAns.\n(i) \\( \\log _{3} 81=4 \\quad \\therefore 3^{4}=81 \\)\n(ii) \\( \\log _{8} 4=\\frac{2}{3} \\quad \\therefore 8^{\\frac{2}{3}}=4 \\)\n(iii) \\( \\log _{2} \\frac{1}{8}=-3 \\quad \\therefore \\quad 2^{-3}=\\frac{1}{8} \\)\n(iv) \\( \\log _{10}(0.01)=-2 \\quad \\therefore \\quad 10^{-2}=0.01 \\)\n(v) \\( \\log _{5}\\left(\\frac{1}{5}\\right)=-1 \\quad \\therefore \\quad 5^{-1}=\\frac{1}{5} \\)\n(vi) \\( \\log _{a} 1=0 \\)\n\\( \\therefore a^{0}=1 \\)\nMath Class IX\n1\nQuestion Bank'

Upvotes: 3

Views: 92

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627469

You can use

(?m)^(?!$)(?:((?i:Question|Problem:?|Example)|[A-Z])[. ]?)?(\d+[. ]?)?(?=\s)

See the regex demo.

Details:

  • (?m)^ - start of a line (m allows ^ to match any line start position)
  • (?!$) - no end of line allowed at the same location (i.e. no empty line match allowed)
  • (?:((?i:Question|Problem:?|Example)|[A-Z])[. ]?)? - an optional sequence of
    • ((?i:Question|Problem:?|Example)|[A-Z]) - Group 1: Question, Problem, Problem: or Example case insensitively, or an uppercase letter
    • [. ]? - a space or .
  • (\d+[. ]?)? - an optional capturing group with ID 2 matching one or more digits and then an optional . or space
  • (?=\s) - a positive lookahead that requires a whitespace char immediately to the right of the current location.

Upvotes: 1

dc-ddfe
dc-ddfe

Reputation: 495

No shame in just doing the dumb solution:

^(\d+\.|Q \d+\.|Q \d+|Q\.\d+\.|\d+|Question \d+|Example( \d+)?|Problem \d+|Problem:|[A-Z]\d\.|EXAMPLE)\s+

Upvotes: 0

Related Questions