Reputation: 4210
Demo at regex101. I have the following text file (a bibtex .bbl file):
\bibitem[{\textit{Alfonsi et~al.}(2011{\natexlab{a}})\textit{Alfonsi, Spogli,
De~Franceschi, Romano, Aquino, Dodson, and Mitchell}}]{alfonsi2011bcg}
Alfonsi, L., L.~Spogli, G.~De~Franceschi, V.~Romano, M.~Aquino, A.~Dodson, and
C.~N. Mitchell (2011{\natexlab{a}}), Bipolar climatology of {GPS} ionospheric
scintillation at solar minimum, \textit{Radio Science}, \textit{46}(3),
\doi{10.1029/2010RS004571}.
\bibitem[{\textit{Alfonsi et~al.}(2011{\natexlab{b}})\textit{Alfonsi, Spogli,
Tong, De~Franceschi, Romano, Bourdillon, Le~Huy, and
Mitchell}}]{alfonsi2011gsa}
Alfonsi, L., L.~Spogli, J.~Tong, G.~De~Franceschi, V.~Romano, A.~Bourdillon,
M.~Le~Huy, and C.~Mitchell (2011{\natexlab{b}}), {GPS} scintillation and
{TEC} gradients at equatorial latitudes in april 2006, \textit{Advances in
Space Research}, \textit{47}(10), 1750--1757,
\doi{10.1016/j.asr.2010.04.020}.
\bibitem[{\textit{Anghel et~al.}(2008)\textit{Anghel, Astilean, Letia, and
Komjathy}}]{anghel2008nrm}
Anghel, A., A.~Astilean, T.~Letia, and A.~Komjathy (2008), Near real-time
monitoring of the ionosphere using dual frequency {GPS} data in a kalman
filter approach, in \textit{{IEEE} International Conference on Automation,
Quality and Testing, Robotics, 2008. {AQTR} 2008}, vol.~2, pp. 54--58,
\doi{10.1109/AQTR.2008.4588793}.
\bibitem[{\textit{Baker and Wing}(1989)}]{baker1989nmc}
Baker, K.~B., and S.~Wing (1989), A new magnetic coordinate system for
conjugate studies at high latitudes, \textit{Journal of Geophysical Research:
Space Physics}, \textit{94}(A7), 9139--9143, \doi{10.1029/JA094iA07p09139}.
I want to match the whole \bibitem
command for a single entry (with some capture groups) if I know the reference code at the end of the command. I use this regex, which works for the first entry, but not for the rest (second entry exemplified below):
\\bibitem\[{(.*?)\((.*?)\)(.*?)}\]{alfonsi2011gsa}
This doesn't work, since it matches everything from the start of the first \bibitem
command to the end of the second \bibitem
command. How can I match only the second \bibitem
command? I have tried using a negative lookahead for ^$
and \n\n
, but I couldn't get either to work - basically, I want the third (.*?)
to match any string not including two consecutive newlines. (If there's any other way to do this, I'm all ears.)
Upvotes: 2
Views: 105
Reputation: 180391
regex is not my strong point but this will get all the content you want without reading all the content into memory at once:
from itertools import groupby
import re
with open("file.txt") as f:
r = re.compile(r"\[{(.*?)\((.*?)\)(.*?)}\]\{alfonsi2011gsa\}")
for k, v in groupby(map(str.strip, f), key=lambda x: bool(x.strip())):
match = r.search("".join(v))
if match:
print(match.groups())
('\\textit{Alfonsi et~al.}', '2011{\\natexlab{b}}', '\\textit{Alfonsi, Spogli,Tong, De~Franceschi, Romano, Bourdillon, Le~Huy, andMitchell}')
Upvotes: 0
Reputation: 2719
You can use negative look-arounds (?!
) to prevent the match from having multiple occurrences of 'bibitem'. With this, the match will start with the 'bibitem' which immediately precedes your reference code. This seems to work:
\\bibitem\[{(((?!bibitem).)*?)\((((?!bibitem).)*?)\)(((?!bibitem).)*?)}\]{alfonsi2011gsa}
Upvotes: 1