Faisal
Faisal

Reputation: 151

Regex for specific Pattern

I have this text:

What can cause skidding on bends?

All of the following can:
SA Faulty shock-absorbers
SA Insufficient or uneven tyre pressure
[| Load is too small

What can cause a dangerous situation?

SA Brakes which engage heavily on one side
SA Too much steering-wheel play

[| Disturbed reception of traffic information on the radio

It starts raining. Why must you immediately increase the safe distance?

What is correct

[| Because the brakes react more quickly

SA Because a greasy film may form which increases the braking distance

SA Because a second greasy film may form which increases the braking distance


What the text is about?

Above are multiple choice questions with multiple options. The question stem is almost always ends with '?' but sometimes there is additional text before the multiple option starts. All options either starts by the word 'SA' or '[|' , all option starts with 'SA'are correct and the option starts with '[|' or '[]' are wrong.

What I want to Do

I want to split the questions and all multiple option and save them into python dictionary/list ideally as key values pairs {'ques': 'blalal','opt1':'this is option one', 'option2': 'this is option two'} and so on

What I have tried? rx='r.*\?$\s*\w*(?:SA|\[\|)'

this is Reg101 link

Upvotes: 0

Views: 48

Answers (2)

Ryszard Czech
Ryszard Czech

Reputation: 18611

Assuming you have three options at all times:

p = r'(?m)^(?P<ques>\w[^?]*\?)[\s\S]*?^(?P<opt1>(?:SA|\[(?:\||\s])).*)\s+^(?P<opt2>(?:SA|\[(?:\||\s])\[\|).*)\s+^(?P<opt3>(?:SA|\[(?:\||\s])).*)'
dt = [x.groupdict() for x in re.finditer(p, string)]

See regex proof and Python proof.

Results:

[{'ques': 'What can cause skidding on bends?', 'opt1': 'SA Faulty shock-absorbers', 'opt2': 'SA Insufficient or uneven tyre pressure', 'opt3': '[| Load is too small'}, {'ques': 'What can cause a dangerous situation?', 'opt1': 'SA Brakes which engage heavily on one side', 'opt2': 'SA Too much steering-wheel play', 'opt3': '[| Disturbed reception of traffic information on the radio'}, {'ques': 'It starts raining. Why must you immediately increase the safe distance?', 'opt1': '[| Because the brakes react more quickly', 'opt2': 'SA Because a greasy film may form which increases the braking distance', 'opt3': 'SA Because a second greasy film may form which increases the braking distance'}]

Upvotes: 1

urban
urban

Reputation: 5682

This is one of the cases that I would recommend not using regex since it can get very complex very fast. My solution would be the following parser:

def parse(fname = "/tmp/data.txt"):
    questions = []
    with open(fname) as f:
        for line in f:
            lstrip = line.strip()

            # Skip empty lines
            if not lstrip:
                continue

            # Check for Questions
            is_option = (
                lstrip.startswith("[]")
                or lstrip.startswith("[|")
                or lstrip.startswith("SA")
            )

            if not is_option:
                # Here we know that this line is not empty and is not
                # an option... We have two options:
                # 1. This is continuation of the last question
                # 2. This is a new question

                if not questions or questions[-1]["options"]:
                    # Last questions has options, this is a new question!
                    questions.append({
                        "ques": [lstrip],
                        "options": []
                    })
                else:
                    # We are still parsing the questions part. Add a new line
                    questions[-1]["ques"].append(lstrip)

                # We are done with the question part, move on
                continue

            # We are only here if we are parsing options!
            is_correct = lstrip.startswith("SA")

            # We _must_ have at least one question
            assert questions

            # Add the option
            questions[-1]["options"].append({
                "option": lstrip,
                "correct": is_correct,
                "number": len(questions[-1]["options"]) + 1,
            })

    # End of with
    return questions

An example usage of the above and its output:

# main
data = parse()
# json just for pretty printing
import json
print(json.dumps(data, indent=4))

---

$ python3 ~/tmp/so.py
[
    {
        "ques": [
            "What can cause skidding on bends?",
            "All of the following can:"
        ],
        "options": [
            {
                "option": "SA Faulty shock-absorbers",
                "correct": true,
                "number": 1
            },
            {
                "option": "SA Insufficient or uneven tyre pressure",
                "correct": true,
                "number": 2
            },
            {
                "option": "[| Load is too small",
                "correct": false,
                "number": 3
            }
        ]
    },
    {
        "ques": [
            "What can cause a dangerous situation?"
        ],
        "options": [
            {
                "option": "SA Brakes which engage heavily on one side",
                "correct": true,
                "number": 1
            },
            {
                "option": "SA Too much steering-wheel play",
                "correct": true,
                "number": 2
            },
            {
                "option": "[| Disturbed reception of traffic information on the radio",
                "correct": false,
                "number": 3
            }
        ]
    },
    {
        "ques": [
            "It starts raining. Why must you immediately increase the safe distance?",
            "What is correct"
        ],
        "options": [
            {
                "option": "[| Because the brakes react more quickly",
                "correct": false,
                "number": 1
            },
            {
                "option": "SA Because a greasy film may form which increases the braking distance",
                "correct": true,
                "number": 2
            },
            {
                "option": "SA Because a second greasy film may form which increases the braking distance",
                "correct": true,
                "number": 3
            }
        ]
    }
]

There are few advantages in using a custom parser instead of regex:

  • A lot more readable (think what would you like to read when you go back to this project in 6 months :) )
  • More control on which lines you keep or how you trim them
  • Easier to deal with bad input data (debug them using logging)

That said, data are rarely perfect and in most cases few workarounds might be required to get the desired output. For example, in your original data, the "All of the following can:" does not seem like an option since it does not start with any of the option sequences. However, it also does not seem to me like part of the question! You will have to deal with such cases based on your dataset (and doing so in regex will be a lot harder). In this particular case you can:

  • Only consider part of the question anything that ends with ? (problematic in 3rd question)
  • Treat lines starting with "None" or "All" as options
  • etc

The exact solution depends on your data quality/cases but the code above should be easy to adjust in most cases

Upvotes: 1

Related Questions