Reputation: 125
my_str :
PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'
my code
regex = re.compile(r'(Applicants:)( )?(.*)', re.MULTILINE)
print(regex.findall(text))
my output :
[('Applicants:', ' ', 'Silixa Ltd.')]
what I need is to get the string between 'Applicants:' and '\nInventors:'
'Silixa Ltd.' & 'Chevron U.S.A. Inc. (Incorporated
in USA - California)'
Thanks in advance for your help
Upvotes: 1
Views: 1000
Reputation: 163207
If you want to match all the text between \nApplicants:
and \nInventors:
, you could also get the match without using re.DOTALL
preventing unnecessary backtracking.
Match Applicants:
and capture in group 1 the rest of that same line and all lines that follow that do not start with Inventors:
Then match Inventors.
^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:
^
Start of string (Or use \b
if it does not have to be at the start)Applicants:
Match literally(
Capture group 1
.*
Match the rest of the line(?:\r?\n(?!Inventors:).*)*
Match all lines that do not start with Inverntors:)
Close group\r?\nInventors:
Match a newline and Inventors:Example code
import re
text = ("PCT Filing Date: 2 December 2015\n"
"Applicants: Silixa Ltd.\n"
"Chevron U.S.A. Inc. (Incorporated\n"
"in USA - California)\n"
"Inventors: Farhadiroushan,\n"
"Mahmoud\n"
"Gillies, Arran\n"
"Parker, Tom'")
regex = re.compile(r'^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:', re.MULTILINE)
print(regex.findall(text))
Output
['Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)']
Upvotes: 1
Reputation: 103744
Here is a more general approach to parse a string like that into a dict of all the keys and values in it (ie, any string at the start of a line followed by a :
is a key and the string following that key is data):
import re
txt="""\
PCT Filing Date: 2 December 2015
Applicants: Silixa Ltd.
Chevron U.S.A. Inc. (Incorporated
in USA - California)
Inventors: Farhadiroushan,
Mahmoud
Gillies, Arran
Parker, Tom'"""
pat=re.compile(r'(^[^\n:]+):[ \t]*([\s\S]*?(?=(?:^[^\n:]*:)|\Z))', flags=re.M)
data={m.group(1):m.group(2) for m in pat.finditer(txt)}
Result:
>>> data
{'PCT Filing Date': '2 December 2015\n', 'Applicants': 'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n', 'Inventors': "Farhadiroushan,\nMahmoud\nGillies, Arran\nParker, Tom'"}
>>> data['Applicants']
'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n'
Upvotes: 0
Reputation: 3737
Try using re.DOTALL instead:
import re
text='''PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'''
regex = re.compile(r'Applicants:(.*?)Inventors:', re.DOTALL)
print(regex.findall(text))
gives me
$ python test.py
[' Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n\n']
The reason this works is that MULTILINE doesn't let the dot (.) match newlines, whereas DOTALL will.
Upvotes: 2
Reputation: 390
If what you want is the contents between Applicants:
and \nInventors:
, your regex should reflect that:
>>> regex = re.compile(r'Applicants: (.*)Inventors:', re.S)
>>> print(regex.findall(s))
['Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n']
re.S
is the "dot matches all" option, so our (.*)
will also match new lines. Note that this is different from re.MULTILINE
, because re.MULTILINE
only says that our expression should apply to multiple lines, but doesn't change the fact .
will not match newlines. If .
doesn't match newlines, a match like (.*)
will still stop at newlines, not achieving the multiline effect you want.
Also note that if you are not interested in Applicants:
or Inventors:
you may not want to put that between ()
, as in (Inventors:)
in your regex, because the match will try to create a matching group for it. That's the reason you got 3 elements in your output instead of just 1.
Upvotes: 1