Reputation: 1885
Right now I am working on a python script to parse the My Clippings file generated by kindles when someone highlights, takes a note, or bookmarks. I am using regex to collect the data from the file, then I plan on storing it in a sqlite database. Right now though, I am having trouble matching the line that contains the title of the book and possibly an author.
There are three possibilities for this line. They can be in the format:
Title (Last, First)
Title (Author)
Title
What I want is for the regex to capture the title, and whatever is in the ending parenthesis if it exists, otherwise capture a blank string. So for example, I want the regex here to give me the results:
('Title', 'Last, First')
('Title', 'Author')
('Title', '')
Right now I managed to do a regex that captures the parenthesis, but not the titles without authors. Here is what I have now:
(.+) (?:\((.+)\)(?:\n|\Z))*
The only issue is that it requires that the line ends with an author, and if I give it an option to accept a blank string, it finds that the entire line is the title without an author. i.e.
('Title (Last, First)', '')
('Title (Author)', '')
('Title', '')
Upvotes: 2
Views: 955
Reputation: 28305
Here's my version, which is very similar to Jerry's, but perhaps a little safer:
(\w+?)(?:\s?\(([\w,\s]*)\))?$
This covers a few more cases such as indentation, missing a space before the brackets, and empty brackets.
Here's a demo: http://www.rubular.com/r/8C1pireOwV
Upvotes: 1
Reputation: 7944
With a file like:
Title (Last, First)
Title (Author)
Title
Title ()
Title ()
The Title (Bob, Jones)
The following:
import re
matches = []
with open('file.txt') as f:
for line in f:
matches.append(re.match(r'^\s*([\w\s]+) \(?(.*?)\)?$',line).groups())
for m in matches:
print m
('Title', 'Last, First')
('Title', 'Author')
('Title', '')
('Title', '')
('Title', '')
('The Title', 'Bob, Jones')
>>>
Will produce your desired result.
Upvotes: 1
Reputation: 71538
If you try to match line by line, you can use this regex:
^(.+?)(?: \((.+)\))?$
I added the start of line anchor and end of line anchor, then put the space in the first non-capturing group, so that the title without any other details can be captured. I changed the *
operator to ?
, since I don't think you'll have more than one pair of brackets. Change if you think you do have more.
I removed the second non capturing group as the end of line anchor will ensure it's the end of the line.
Demo here.
Upvotes: 1