Reputation: 51
I'm a lawyer and python beginner, so I'm both (a) dumb and (b) completely out of my lane.
I'm trying to apply a regex pattern to a text file. The pattern can sometimes stretch across multiple lines. I'm specifically interested in these lines from the text file:
Considered and decided by Hemingway, Presiding Judge; Bell,
Judge; and \n
\n
Dickinson, Emily, Judge.
I'd like to individually hunt for, extract, and then print the judges' names. My code so far looks like this:
import re
def judges():
presiding = re.compile(r'by\s*?([A-Z].*),\s*?Presiding\s*?Judge;', re.DOTALL)
judge2 = re.compile(r'Presiding\s*?Judge;\s*?([A-Z].*),\s*?Judge;', re.DOTALL)
judge3 = re.compile(r'([A-Z].*), Judge\.', re.DOTALL)
with open("text.txt", "r") as case:
for lines in case:
presiding_match = re.search(presiding, lines)
judge2_match = re.search(judge2, lines)
judge3_match = re.search(judge3, lines)
if presiding_match or judge2_match or judge3_match:
print(presiding_match.group(1))
print(judge2_match.group(1))
print(judge3_match.group(1))
break
When I run it, I can get Hemingway and Bell, but then I get an "AttributeError: 'NoneType' object has no attribute 'group'" for the third judge after the two line breaks.
After trial-and-error, I've found that my code is only reading the first line (until the "Bell, Judge; and") then quits. I thought the re.DOTALL would solve it, but I can't seem to make it work.
I've tried a million ways to capture the line breaks and get the whole thing, including re.match, re.DOTALL, re.MULTILINE, "".join, "".join(lines.strip()), and anything else I can throw against the wall to make stick.
After a couple days, I've bowed to asking for help. Thanks for anything you can do.
(As an aside, I've had no luck getting the regex to work with the ^ and $ characters. It also seems to hate the . escape in the judge3 regex.)
Upvotes: 5
Views: 1693
Reputation: 11272
Instead of multiple re.search
, you could use re.findall
with a really short and simple pattern to find all judges at once:
import re
text = """Considered and decided by Hemingway, Presiding Judge; Bell,
Judge; and \n
\n
Dickinson, Emily, Judge."""
matches = re.findall(r"(\w+,)?\s(\w+),(\s+Presiding)?\s+Judge", text)
print(matches)
Which prints:
[('', 'Hemingway', ' Presiding'), ('', 'Bell', ''), ('Dickinson,', 'Emily', '')]
All the raw information is there: first name, last name and "presiding attribute" (if Presiding Judge or not) of each judge. Afterwards, you can feed this raw information into a data structure which satisfies your needs, for example:
judges = []
for match in matches:
if match[0]:
first_name = match[1]
last_name = match[0]
else:
first_name = ""
last_name = match[1]
presiding = "Presiding" in match[2]
judges.append((first_name, last_name, presiding))
print(judges)
Which prints:
[('', 'Hemingway', True), ('', 'Bell', False), ('Emily', 'Dickinson,', False)]
As you can see, now you have a list of tuples, where the first element is the first name (if specified in the text), the second element is the last name and the third element is a bool
whether the judge is the presiding judge or not.
Obviously, the pattern works for your provided example. However, since (\w+,)?\s(\w+),(\s+Presiding)?\s+Judge
is such a simple pattern, there are some edge cases to be aware of, where the pattern might return the wrong result:
Dickinson, Emily Mary
will result in Mary
detected as the last name.de Broglie
will result in only Broglie
matched, so de
gets lost.You will have to see if this fits your needs or provide more information to your question about your data.
Upvotes: 1
Reputation: 1488
Assuming you can read the file all at once (ie the file is not too big). You can extract judge information as follows:
import re
regex = re.compile(
r'decided\s+by\s+(?P<presiding_judge>[A-Za-z]+)\s*,\s+Presiding\s+Judge;'
r'\s+(?P<judge>[A-Za-z]+)\s*,\s+Judge;'
r'\s+and\s+(?P<extra_judges>[A-Za-z,\s]+)\s*,\s+Judge\.?',
re.DOTALL | re.MULTILINE
)
filename = 'text.txt'
with open(filename) as fd:
data = fd.read()
for match in regex.finditer(data):
print(match.groupdict())
with sample input text file (text.txt
) looking like this, the output becomes:
{'judge': 'Bell', 'extra_judges': 'Dickinson, Emily', 'presiding_judge': 'Hemingway'}
{'judge': 'Abel', 'extra_judges': 'Lagrange, Gauss', 'presiding_judge': 'Einstein'}
{'judge': 'Dirichlet', 'extra_judges': 'Fourier, Cauchy', 'presiding_judge': 'Newton'}
You can also play with this at regex101 site
Upvotes: 1
Reputation: 1123360
You are passing in single lines, because you are iterating over the open file referenced by case
. The regex is never passed anything other than a single line of text. Your regexes can each match some of the lines, but they don't all together match the same single line.
You'd have to read in more than one line. If the file is small enough, just read it as one string:
with open("text.txt", "r") as case:
case_text = case.read()
then apply your regular expressions to that one string.
Or, you could test each of the match objects individually, not as a group, and only print those that matched:
if presiding_match:
print(presiding_match.group(1))
elif judge2_match:
print(judge2_match.group(1))
elif judge3_match:
print(judge3_match.group(1))
but then you'll have to create additional logic to determine when you are done reading from the file and break out of the loop.
Note that the patterns you are matching are not broken across lines, so the DOTALL
flag is not actually needed here. You do match .*
text, so you are running the risk of matching too much if you use DOTALL
:
>>> import re
>>> case_text = """Considered and decided by Hemingway, Presiding Judge; Bell, Judge; and
...
... Dickinson, Emily, Judge.
... """
>>> presiding = re.compile(r'by\s*?([A-Z].*),\s*?Presiding\s*?Judge;', re.DOTALL)
>>> judge2 = re.compile(r'Presiding\s*?Judge;\s*?([A-Z].*),\s*?Judge;', re.DOTALL)
>>> judge3 = re.compile(r'([A-Z].*), Judge\.', re.DOTALL)
>>> presiding.search(case_text).groups()
('Hemingway',)
>>> judge2.search(case_text).groups()
('Bell',)
>>> judge3.search(case_text).groups()
('Considered and decided by Hemingway, Presiding Judge; Bell, Judge; and \n\nDickinson, Emily',)
I'd at least replace [A-Z].*
with [A-Z][^;\n]+
, to at least exclude matching ;
semicolons and newlines, and only match names at least 2 characters long. Just drop the DOTALL
flags altogether:
>>> presiding = re.compile(r'by\s*?([A-Z][^;]+),\s+?Presiding\s+?Judge;')
>>> judge2 = re.compile(r'Presiding\s+?Judge;\s+?([A-Z][^;]+),\s+?Judge;')
>>> judge3 = re.compile(r'([A-Z][^;]+), Judge\.')
>>> presiding.search(case_text).groups()
('Hemingway',)
>>> judge2.search(case_text).groups()
('Bell',)
>>> judge3.search(case_text).groups()
('Dickinson, Emily',)
You can combine the three patterns into one:
judges = re.compile(
r'(?:Considered\s+?and\s+?decided\s+?by\s+?)?'
r'([A-Z][^;]+),\s+?(?:Presiding\s+?)?Judge[.;]'
)
which can find all the judges in your input in one go with .findall()
:
>>> judges.findall(case_text)
['Hemingway', 'Bell', 'Dickinson, Emily']
Upvotes: 2