Vignesh Veeresh
Vignesh Veeresh

Reputation: 39

Regular Expression for MCQ type strings

How can I extract multiple choice question with its options from a text document. Each question start with a number and dot. Each question can span multiple line and may/may not have a full stop or question mark. I want to make a dictionary with question number and corresponding question and options. I'm using python for this.

17.
If you go on increasing the stretching force on a wire in a
guitar, its frequency.
(a)
increases
(b)
decreases
(c)
remains unchanged
(d)
None of these

some random text between questions
18.
A vibrating body
(a)
will always produce sound
(b)
may or may not produce sound if the amplitude of
vibration is low
(c)
will produce sound which depends upon frequency
(d)
None of these
19.
The wavelength of infrasonics in air is of the order of
(a)
100 m
(b)
101 m
(c)
10–1 m
(d)
10–2 m

Upvotes: 3

Views: 462

Answers (4)

Cary Swoveland
Cary Swoveland

Reputation: 110685

You can use the following regular expression with Python's standard re module to match each question.

r'(?P<number>\d+)\. *\r?\n(?P<question>(?:(?!\([a-z]\)).*\r?\n)+)(?P<options>(?:(?!(?<=\n)\d+\. *\r?\n).*\r?\n)+)'

The question number will be contained in the capture group (named) number, the question itself will be contained in the capture group question and the options will be contained in the capture group options.

The contents of the capture groups could then be easily grabbed with Python code and processed as desired. For example, one might construct an an array of questions, each a hash with keys for number, question and options, or possibly a hash with keys being the question number and values being a hash with keys for the question and options.

Start your engine!

Python's regex engine performs the following operations.

(?P<number>\d+)  : match 1+ digits in capture group 'number'
\. *\r?\n        : match '.' 0+ spaces, line terminator 
(?P<question>    : begin capture group 'question'
  (?:            : begin non-capture group
    (?!          : begin negative lookahead
      \([a-z]\)  : match '(', one lowercase letter, ')'
    )            : end negative lookahead
    .*\r?\n      : match 0+ characters, '\r' optionally, '\n'
  )              : end non-capture group
  +              : execute non-capture group 1+ times
)                : end capture group 'question'
(?P<options>     : begin capture group 'options'
  (?:            : begin non-capture group
    (?!          : begin negative lookahead
      (?<=\n)    : positive lookbehind asserts next character is
                   preceded by a '\n'
      \d+        : match 1+ digits
      \. *\r?\n  : match '.' 0+ spaces, line terminator 
    )            : end negative lookahead
    .*\r?\n      : match 0+ characters, '\r' optionally, '\n'
  )              : end non-capture group
  +              : execute non-capture group 1+ times
)                : end capture group 'options'

In two locations I match any character (.). That could of course be replaced with a character class that limits possibilities, such as [a-zA-Z\d() -–]. ref

Upvotes: 1

theX
theX

Reputation: 1134

REGEX: \d+\.([^(]+) It gets the numbers, then a dot.

Then it captures all the stuff that's not a ( (the start of the answers).

Test the regex here if you're unsure it's that easy.

Python code:

import re # Imports the standard regex module

text_doc = """
17.
If you go on increasing the stretching force on a wire in a
guitar, its frequency.
(a)
increases
(b)
decreases
(c)
remains unchanged
(d)
None of these

some random text between questions
18.
A vibrating body
(a)
will always produce sound
(b)
may or may not produce sound if the amplitude of
vibration is low
(c)
will produce sound which depends upon frequency
(d)
None of these
19.
The wavelength of infrasonics in air is of the order of
(a)
100 m
(b)
101 m
(c)
10–1 m
(d)
10–2 m
"""

question_getter = re.compile('\\d+\\.([^(]+)')

print(question_getter.findall(text_doc))

EDIT: but since many people are parsing stuff here, I guess I'll parse stuff, too

Regex for getting the possible answers: \([a-zA-Z]+\)\n(.+)

proof

UPDATED PYTHON:

import re # Imports the standard regex module


text_doc = """
17.
If you go on increasing the stretching force on a wire in a
guitar, its frequency.
(a)
increases
(b)
decreases
(c)
remains unchanged
(d)
None of these

some random text between questions
18.
A vibrating body
(a)
will always produce sound
(b)
may or may not produce sound if the amplitude of
vibration is low
(c)
will produce sound which depends upon frequency
(d)
None of these
19.
The wavelength of infrasonics in air is of the order of
(a)
100 m
(b)
101 m
(c)
10–1 m
(d)
10–2 m
"""

question_getter = re.compile('\\d+\\.([^(]+)')
answer_getter = re.compile('\\([a-zA-Z]+\\)\\n(.+)')


# This is where the magical parsing happens
# It could've been organized differently
parsed = {question:answer_getter.findall(text_doc)
    for question in question_getter.findall(text_doc)
}

print(parsed)

Upvotes: 1

Ahmed
Ahmed

Reputation: 74

Hamza's answer is good, but it misses the fact that an answer might be multi-lined.

Better Solution: (assuming text in question is in data.txt file)

import re

with open('data.txt', 'r', encoding='utf8') as file:
    data = file.read()

questions = re.split(r'\n\s*\n', data) #splits the questions into a list assuming there is no empty lines inside each question
final_questions = []

for question in questions:
    if question != None and '(a)' in question: #extra check to make sure that this a question
        statement = re.findall(r'[^(]+', question)[0].replace('\n', ' ').rstrip()
        option_a = re.findall(r'\(a\)[^(]+', question)[0].replace('\n', ' ').rstrip()
        option_b = re.findall(r'\(b\)[^(]+', question)[0].replace('\n', ' ').rstrip()
        option_c = re.findall(r'\(c\)[^(]+', question)[0].replace('\n', ' ').rstrip()
        option_d = re.findall(r'\(d\)[^(]+', question)[0].replace('\n', ' ').rstrip()
        final_questions.append({
                    'statement': statement.rstrip(),
                    'options': [option_a, option_b, option_c, option_d]
                })

print(final_questions)

Output:

[
   {
      "statement":"17. If you go on increasing the stretching force on a wire in a guitar, its frequency.",
      "options":[
         "(a) increases",
         "(b) decreases",
         "(c) remains unchanged",
         "(d) None of these"
      ]
   },
   {
      "statement":"18. A vibrating body",
      "options":[
         "(a) will always produce sound",
         "(b) may or may not produce sound if the amplitude of vibration is low",
         "(c) will produce sound which depends upon frequency",
         "(d) None of these"
      ]
   },
   {
      "statement":"19. The wavelength of infrasonics in air is of the order of",
      "options":[
         "(a) 100 m",
         "(b) 101 m",
         "(c) 10–1 m",
         "(d) 10–2 m"
      ]
   }
]

Note:: There should be at least one empty line between each question

Upvotes: 1

Hamza Rashid
Hamza Rashid

Reputation: 1387

Solution

Suppose your questions come from questions.txt file.

17.
If you go on increasing the stretching force on a wire in a
guitar, its frequency.
(a)
increases
(b)
decreases
(c)
remains unchanged
(d)
None of these

some random text between questions
18.
A vibrating body
(a)
will always produce sound
(b)
may or may not produce sound if the amplitude of
vibration is low
(c)
will produce sound which depends upon frequency
(d)
None of these
19.
The wavelength of infrasonics in air is of the order of
(a)
100 m
(b)
101 m
(c)
10–1 m
(d)
10–2 m

Python code to parse questions.txt per requirements.

import re

filename = 'questions.txt'
questions = []

with open(file=filename, mode='r', encoding='utf8') as f:
    lines = f.readlines()

    is_label = False  # means matched: 17.|(a)|(b)|(c)|(d)
    statement = option_a = option_b = option_c = option_d = ''

    for line in lines:
        if re.match(r'^\d+\.$', line):
            is_statement = is_label = True
            is_option_a = is_option_b = is_option_c = is_option_d = False
        elif re.match(r'^\(a\)$', line):
            is_option_a = is_label = True
            is_statement = is_option_b = is_option_c = is_option_d = False
        elif re.match(r'^\(b\)$', line):
            is_option_b = is_label = True
            is_statement = is_option_a = is_option_c = is_option_d = False
        elif re.match(r'^\(c\)$', line):
            is_option_c = is_label = True
            is_statement = is_option_a = is_option_b = is_option_d = False
        elif re.match(r'^\(d\)$', line):
            is_option_d = is_label = True
            is_statement = is_option_a = is_option_b = is_option_c = False
        else:
            is_label = False

        if is_label:
            continue

        if is_statement:
            statement += line
        elif is_option_a:
            option_a = line.rstrip()
        elif is_option_b:
            option_b = line.rstrip()
        elif is_option_c:
            option_c = line.rstrip()
        elif is_option_d:
            option_d = line.rstrip()

            if statement:
                questions.append({
                    'statement': statement.rstrip(),
                    'options': [option_a, option_b, option_c, option_d]
                })
                statement = option_a = option_b = option_c = option_d = ''

print(questions)

Output

[
  {
    "statement": "If you go on increasing the stretching force on a wire in a\nguitar, its frequency.",
    "options": [
      "increases",
      "decreases",
      "remains unchanged",
      "None of these"
    ]
  },
  {
    "statement": "A vibrating body",
    "options": [
      "will always produce sound",
      "vibration is low",
      "will produce sound which depends upon frequency",
      "None of these"
    ]
  },
  {
    "statement": "The wavelength of infrasonics in air is of the order of",
    "options": [
      "100 m",
      "101 m",
      "10–1 m",
      "10–2 m"
    ]
  }
]

Side note

  • Text like some random text between questions is ignored
  • Question with multi-line statement is kept as it is (means, newline character is intentionally not removed). You can choose to replace \n with <space> character.

Upvotes: 1

Related Questions