Reputation: 63

Parsing transcripts with regular expression

I have a text which format resemble this sample :

PAUL: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo >ligula eget dolor.

LEONARD: Aenean massa. Cum sociis natoque penatibus et magnis dis parturient >montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque >eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, >fringilla vel, aliquet nec, vulputate eget, arcu.

EVIL NINJA [on the roof]: In enim justo, rhoncus ut, imperdiet a, venenatis >vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. >Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. >Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim.

PAUL [SCREAMING]: Aliquam lorem ante, dapibus in, viverra quis, feugiat a, >tellus.

And a regular expression to parse the transcripts into dialogs.

[A-Z]+([:]|[ ]{1}[[][A-Z]*[]])

I am trying to capture all the locutors so that the regular expression matches

"PAUL:", 
"LEONARD [some context]:"

As you can see here I have not been able to capture all of the locutors.

EVIL NINJA [on the roof]:

How can I capture the above as well ? Is regex even the right way to go for this ?

Edit : All the speakers name are in caps, and ends with a colon. This is the format in which all of the transcripts i'm dealing with is.

Upvotes: 0

Answers (3)

Harry

Reputation: 318

regex

"^([A-Z\s]+)(?:\[(?:[\w ]+)\])?:(.*?)$"

A-Z can be changed to \w
For getting the context (?:[\w ]+) should be changed to ([\w ]+)

code

import re

regex = r"^([A-Z\s]+)(?:\[(?:[\w ]+)\])?:(.*?)$"

test_str = ("PAUL: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. \n\n"
        "LEONARD: Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. \n\n"
        "EVIL NINJA [on the roof]: In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim. \n\n"
        "PAUL [SCREAMING]: Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. ")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

output

Match 1 was found at 0-100: PAUL: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor.     
Group 1 found at 0-4: PAUL
Group 2 found at 5-97:  Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor.

Match 2 was found at 100-381: LEONARD: Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. 
Group 1 found at 100-107: LEONARD
Group 2 found at 108-378:  Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu.

Match 3 was found at 381-684: EVIL NINJA [on the roof]: In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim.     
Group 1 found at 381-392: EVIL NINJA 
Group 2 found at 406-681:  In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim.

Match 4 was found at 684-767: PAUL [SCREAMING]: Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. 
Group 1 found at 684-689: PAUL 
Group 2 found at 701-767:  Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus.

Upvotes: 0

Sweeper

Reputation: 271990

[A-Z ]+(:|\[[a-zA-Z ]+\]:)

I think what you got wrong was that you did not match lowercase letters in the the []s, so [on the roof] did not match. I've added a-z to the character class and now it matches. Also, you did not allow white space in the character's name, so I've changed the start to [A-Z ].

try it here!

Upvotes: 0

Aran-Fey

Reputation: 43216

The problem with your regex is that it doesn't allow any whitespace, so it doesn't match "EVIL NINJA" or "on the roof".

But yes, regex is absolutely the right way to do this. You can try this:

([A-Z][A-Z ]*)(?: \[([\w ]+)\])?:

Usage:

regex = r'([A-Z][A-Z ]*)(?: \[([\w ]+)\])?:'

for match in re.finditer(regex, text):
    print('person:', match.group(1))
    print('context:', match.group(2))
    print()

Output:

person: PAUL
context: None

person: LEONARD
context: None

person: EVIL NINJA
context: on the roof

person: PAUL
context: SCREAMING

Upvotes: 3

Parsing transcripts with regular expression

Answers (3)

Related Questions