user1539179
user1539179

Reputation: 1885

Using regex to capture text in parenthesis if they exist

Right now I am working on a python script to parse the My Clippings file generated by kindles when someone highlights, takes a note, or bookmarks. I am using regex to collect the data from the file, then I plan on storing it in a sqlite database. Right now though, I am having trouble matching the line that contains the title of the book and possibly an author.

There are three possibilities for this line. They can be in the format:

Title (Last, First)
Title (Author)
Title

What I want is for the regex to capture the title, and whatever is in the ending parenthesis if it exists, otherwise capture a blank string. So for example, I want the regex here to give me the results:

('Title', 'Last, First')
('Title', 'Author')
('Title', '')

Right now I managed to do a regex that captures the parenthesis, but not the titles without authors. Here is what I have now:

(.+) (?:\((.+)\)(?:\n|\Z))*

The only issue is that it requires that the line ends with an author, and if I give it an option to accept a blank string, it finds that the entire line is the title without an author. i.e.

('Title (Last, First)', '')
('Title (Author)', '')
('Title', '')

Upvotes: 2

Views: 955

Answers (3)

Tom Lord
Tom Lord

Reputation: 28305

Here's my version, which is very similar to Jerry's, but perhaps a little safer:

(\w+?)(?:\s?\(([\w,\s]*)\))?$

This covers a few more cases such as indentation, missing a space before the brackets, and empty brackets.

Here's a demo: http://www.rubular.com/r/8C1pireOwV

Upvotes: 1

HennyH
HennyH

Reputation: 7944

With a file like:

Title (Last, First)
Title (Author)
Title 
Title ()
    Title ()
The Title (Bob, Jones)

The following:

import re
matches = []
with open('file.txt') as f:
    for line in f:
        matches.append(re.match(r'^\s*([\w\s]+) \(?(.*?)\)?$',line).groups())

for m in matches:
    print m
('Title', 'Last, First')
('Title', 'Author')
('Title', '')
('Title', '')
('Title', '')
('The Title', 'Bob, Jones')
>>> 

Will produce your desired result.

Upvotes: 1

Jerry
Jerry

Reputation: 71538

If you try to match line by line, you can use this regex:

^(.+?)(?: \((.+)\))?$

I added the start of line anchor and end of line anchor, then put the space in the first non-capturing group, so that the title without any other details can be captured. I changed the * operator to ?, since I don't think you'll have more than one pair of brackets. Change if you think you do have more.

I removed the second non capturing group as the end of line anchor will ensure it's the end of the line.

Demo here.

Upvotes: 1

Related Questions