Reputation: 151
I have this text:
What can cause skidding on bends?
All of the following can:
SA Faulty shock-absorbers
SA Insufficient or uneven tyre pressure
[| Load is too small
What can cause a dangerous situation?
SA Brakes which engage heavily on one side
SA Too much steering-wheel play
[| Disturbed reception of traffic information on the radio
It starts raining. Why must you immediately increase the safe distance?
What is correct
[| Because the brakes react more quickly
SA Because a greasy film may form which increases the braking distance
SA Because a second greasy film may form which increases the braking distance
What the text is about?
Above are multiple choice questions with multiple options. The question stem is almost always ends with '?' but sometimes there is additional text before the multiple option starts. All options either starts by the word 'SA' or '[|' , all option starts with 'SA'are correct and the option starts with '[|' or '[]' are wrong.
What I want to Do
I want to split the questions and all multiple option and save them into python dictionary/list ideally as key values pairs
{'ques': 'blalal','opt1':'this is option one', 'option2': 'this is option two'
} and so on
What I have tried?
rx='r.*\?$\s*\w*(?:SA|\[\|)'
this is Reg101 link
Upvotes: 0
Views: 48
Reputation: 18611
Assuming you have three options at all times:
p = r'(?m)^(?P<ques>\w[^?]*\?)[\s\S]*?^(?P<opt1>(?:SA|\[(?:\||\s])).*)\s+^(?P<opt2>(?:SA|\[(?:\||\s])\[\|).*)\s+^(?P<opt3>(?:SA|\[(?:\||\s])).*)'
dt = [x.groupdict() for x in re.finditer(p, string)]
See regex proof and Python proof.
Results:
[{'ques': 'What can cause skidding on bends?', 'opt1': 'SA Faulty shock-absorbers', 'opt2': 'SA Insufficient or uneven tyre pressure', 'opt3': '[| Load is too small'}, {'ques': 'What can cause a dangerous situation?', 'opt1': 'SA Brakes which engage heavily on one side', 'opt2': 'SA Too much steering-wheel play', 'opt3': '[| Disturbed reception of traffic information on the radio'}, {'ques': 'It starts raining. Why must you immediately increase the safe distance?', 'opt1': '[| Because the brakes react more quickly', 'opt2': 'SA Because a greasy film may form which increases the braking distance', 'opt3': 'SA Because a second greasy film may form which increases the braking distance'}]
Upvotes: 1
Reputation: 5682
This is one of the cases that I would recommend not using regex since it can get very complex very fast. My solution would be the following parser:
def parse(fname = "/tmp/data.txt"):
questions = []
with open(fname) as f:
for line in f:
lstrip = line.strip()
# Skip empty lines
if not lstrip:
continue
# Check for Questions
is_option = (
lstrip.startswith("[]")
or lstrip.startswith("[|")
or lstrip.startswith("SA")
)
if not is_option:
# Here we know that this line is not empty and is not
# an option... We have two options:
# 1. This is continuation of the last question
# 2. This is a new question
if not questions or questions[-1]["options"]:
# Last questions has options, this is a new question!
questions.append({
"ques": [lstrip],
"options": []
})
else:
# We are still parsing the questions part. Add a new line
questions[-1]["ques"].append(lstrip)
# We are done with the question part, move on
continue
# We are only here if we are parsing options!
is_correct = lstrip.startswith("SA")
# We _must_ have at least one question
assert questions
# Add the option
questions[-1]["options"].append({
"option": lstrip,
"correct": is_correct,
"number": len(questions[-1]["options"]) + 1,
})
# End of with
return questions
An example usage of the above and its output:
# main
data = parse()
# json just for pretty printing
import json
print(json.dumps(data, indent=4))
---
$ python3 ~/tmp/so.py
[
{
"ques": [
"What can cause skidding on bends?",
"All of the following can:"
],
"options": [
{
"option": "SA Faulty shock-absorbers",
"correct": true,
"number": 1
},
{
"option": "SA Insufficient or uneven tyre pressure",
"correct": true,
"number": 2
},
{
"option": "[| Load is too small",
"correct": false,
"number": 3
}
]
},
{
"ques": [
"What can cause a dangerous situation?"
],
"options": [
{
"option": "SA Brakes which engage heavily on one side",
"correct": true,
"number": 1
},
{
"option": "SA Too much steering-wheel play",
"correct": true,
"number": 2
},
{
"option": "[| Disturbed reception of traffic information on the radio",
"correct": false,
"number": 3
}
]
},
{
"ques": [
"It starts raining. Why must you immediately increase the safe distance?",
"What is correct"
],
"options": [
{
"option": "[| Because the brakes react more quickly",
"correct": false,
"number": 1
},
{
"option": "SA Because a greasy film may form which increases the braking distance",
"correct": true,
"number": 2
},
{
"option": "SA Because a second greasy film may form which increases the braking distance",
"correct": true,
"number": 3
}
]
}
]
There are few advantages in using a custom parser instead of regex:
That said, data are rarely perfect and in most cases few workarounds might be required to get the desired output. For example, in your original data, the "All of the following can:" does not seem like an option since it does not start with any of the option sequences. However, it also does not seem to me like part of the question! You will have to deal with such cases based on your dataset (and doing so in regex will be a lot harder). In this particular case you can:
?
(problematic in 3rd question)The exact solution depends on your data quality/cases but the code above should be easy to adjust in most cases
Upvotes: 1