chekhov's_gin
chekhov's_gin

Reputation: 51

How to search for regex pattern across multiple lines of text with re.DOTALL?

I'm a lawyer and python beginner, so I'm both (a) dumb and (b) completely out of my lane.

I'm trying to apply a regex pattern to a text file. The pattern can sometimes stretch across multiple lines. I'm specifically interested in these lines from the text file:

Considered  and  decided  by  Hemingway,  Presiding  Judge;  Bell, 
Judge;  and \n
 \n
Dickinson, Emily, Judge.

I'd like to individually hunt for, extract, and then print the judges' names. My code so far looks like this:

import re
def judges():
    presiding = re.compile(r'by\s*?([A-Z].*),\s*?Presiding\s*?Judge;', re.DOTALL)
    judge2 = re.compile(r'Presiding\s*?Judge;\s*?([A-Z].*),\s*?Judge;', re.DOTALL)
    judge3 = re.compile(r'([A-Z].*), Judge\.', re.DOTALL)
    with open("text.txt", "r") as case:
        for lines in case:
            presiding_match = re.search(presiding, lines)
            judge2_match = re.search(judge2, lines)
            judge3_match = re.search(judge3, lines)
            if presiding_match or judge2_match or judge3_match:
                print(presiding_match.group(1))
                print(judge2_match.group(1))
                print(judge3_match.group(1))
                break

When I run it, I can get Hemingway and Bell, but then I get an "AttributeError: 'NoneType' object has no attribute 'group'" for the third judge after the two line breaks.

After trial-and-error, I've found that my code is only reading the first line (until the "Bell, Judge; and") then quits. I thought the re.DOTALL would solve it, but I can't seem to make it work.

I've tried a million ways to capture the line breaks and get the whole thing, including re.match, re.DOTALL, re.MULTILINE, "".join, "".join(lines.strip()), and anything else I can throw against the wall to make stick.

After a couple days, I've bowed to asking for help. Thanks for anything you can do.

(As an aside, I've had no luck getting the regex to work with the ^ and $ characters. It also seems to hate the . escape in the judge3 regex.)

Upvotes: 5

Views: 1693

Answers (3)

finefoot
finefoot

Reputation: 11272

Instead of multiple re.search, you could use re.findall with a really short and simple pattern to find all judges at once:

import re

text = """Considered  and  decided  by  Hemingway,  Presiding  Judge;  Bell, 
Judge;  and \n
 \n
Dickinson, Emily, Judge."""

matches = re.findall(r"(\w+,)?\s(\w+),(\s+Presiding)?\s+Judge", text)
print(matches)

Which prints:

[('', 'Hemingway', '  Presiding'), ('', 'Bell', ''), ('Dickinson,', 'Emily', '')]

All the raw information is there: first name, last name and "presiding attribute" (if Presiding Judge or not) of each judge. Afterwards, you can feed this raw information into a data structure which satisfies your needs, for example:

judges = []
for match in matches:
    if match[0]:
        first_name = match[1]
        last_name = match[0]
    else:
        first_name = ""
        last_name = match[1]
    presiding = "Presiding" in match[2]
    judges.append((first_name, last_name, presiding))
print(judges)

Which prints:

[('', 'Hemingway', True), ('', 'Bell', False), ('Emily', 'Dickinson,', False)]

As you can see, now you have a list of tuples, where the first element is the first name (if specified in the text), the second element is the last name and the third element is a bool whether the judge is the presiding judge or not.

Obviously, the pattern works for your provided example. However, since (\w+,)?\s(\w+),(\s+Presiding)?\s+Judge is such a simple pattern, there are some edge cases to be aware of, where the pattern might return the wrong result:

  • Only one first name will be matched. A name like Dickinson, Emily Mary will result in Mary detected as the last name.
  • Last names like de Broglie will result in only Broglie matched, so de gets lost.
  • ...

You will have to see if this fits your needs or provide more information to your question about your data.

Upvotes: 1

dopstar
dopstar

Reputation: 1488

Assuming you can read the file all at once (ie the file is not too big). You can extract judge information as follows:

import re

regex = re.compile(
    r'decided\s+by\s+(?P<presiding_judge>[A-Za-z]+)\s*,\s+Presiding\s+Judge;'
    r'\s+(?P<judge>[A-Za-z]+)\s*,\s+Judge;'
    r'\s+and\s+(?P<extra_judges>[A-Za-z,\s]+)\s*,\s+Judge\.?',
    re.DOTALL | re.MULTILINE
)

filename = 'text.txt'
with open(filename) as fd:
    data = fd.read()

for match in regex.finditer(data):
    print(match.groupdict())

with sample input text file (text.txt) looking like this, the output becomes:

{'judge': 'Bell', 'extra_judges': 'Dickinson, Emily', 'presiding_judge': 'Hemingway'}
{'judge': 'Abel', 'extra_judges': 'Lagrange, Gauss', 'presiding_judge': 'Einstein'}
{'judge': 'Dirichlet', 'extra_judges': 'Fourier, Cauchy', 'presiding_judge': 'Newton'}

You can also play with this at regex101 site

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1123360

You are passing in single lines, because you are iterating over the open file referenced by case. The regex is never passed anything other than a single line of text. Your regexes can each match some of the lines, but they don't all together match the same single line.

You'd have to read in more than one line. If the file is small enough, just read it as one string:

with open("text.txt", "r") as case:
    case_text = case.read()

then apply your regular expressions to that one string.

Or, you could test each of the match objects individually, not as a group, and only print those that matched:

if presiding_match:
    print(presiding_match.group(1))
elif judge2_match:
    print(judge2_match.group(1))
elif judge3_match:
    print(judge3_match.group(1))

but then you'll have to create additional logic to determine when you are done reading from the file and break out of the loop.

Note that the patterns you are matching are not broken across lines, so the DOTALL flag is not actually needed here. You do match .* text, so you are running the risk of matching too much if you use DOTALL:

>>> import re
>>> case_text = """Considered  and  decided  by  Hemingway,  Presiding  Judge;  Bell, Judge;  and
...
... Dickinson, Emily, Judge.
... """
>>> presiding = re.compile(r'by\s*?([A-Z].*),\s*?Presiding\s*?Judge;', re.DOTALL)
>>> judge2 = re.compile(r'Presiding\s*?Judge;\s*?([A-Z].*),\s*?Judge;', re.DOTALL)
>>> judge3 = re.compile(r'([A-Z].*), Judge\.', re.DOTALL)
>>> presiding.search(case_text).groups()
('Hemingway',)
>>> judge2.search(case_text).groups()
('Bell',)
>>> judge3.search(case_text).groups()
('Considered  and  decided  by  Hemingway,  Presiding  Judge;  Bell, Judge;  and \n\nDickinson, Emily',)

I'd at least replace [A-Z].* with [A-Z][^;\n]+, to at least exclude matching ; semicolons and newlines, and only match names at least 2 characters long. Just drop the DOTALL flags altogether:

>>> presiding = re.compile(r'by\s*?([A-Z][^;]+),\s+?Presiding\s+?Judge;')
>>> judge2 = re.compile(r'Presiding\s+?Judge;\s+?([A-Z][^;]+),\s+?Judge;')
>>> judge3 = re.compile(r'([A-Z][^;]+), Judge\.')
>>> presiding.search(case_text).groups()
('Hemingway',)
>>> judge2.search(case_text).groups()
('Bell',)
>>> judge3.search(case_text).groups()
('Dickinson, Emily',)

You can combine the three patterns into one:

judges = re.compile(
    r'(?:Considered\s+?and\s+?decided\s+?by\s+?)?'
    r'([A-Z][^;]+),\s+?(?:Presiding\s+?)?Judge[.;]'
)

which can find all the judges in your input in one go with .findall():

>>> judges.findall(case_text)
['Hemingway', 'Bell', 'Dickinson, Emily']

Upvotes: 2

Related Questions