cmeeren
cmeeren

Reputation: 4210

How to not match string not containing two consecutive newlines

Demo at regex101. I have the following text file (a bibtex .bbl file):

\bibitem[{\textit{Alfonsi et~al.}(2011{\natexlab{a}})\textit{Alfonsi, Spogli,
  De~Franceschi, Romano, Aquino, Dodson, and Mitchell}}]{alfonsi2011bcg}
Alfonsi, L., L.~Spogli, G.~De~Franceschi, V.~Romano, M.~Aquino, A.~Dodson, and
  C.~N. Mitchell (2011{\natexlab{a}}), Bipolar climatology of {GPS} ionospheric
  scintillation at solar minimum, \textit{Radio Science}, \textit{46}(3),
  \doi{10.1029/2010RS004571}.

\bibitem[{\textit{Alfonsi et~al.}(2011{\natexlab{b}})\textit{Alfonsi, Spogli,
  Tong, De~Franceschi, Romano, Bourdillon, Le~Huy, and
  Mitchell}}]{alfonsi2011gsa}
Alfonsi, L., L.~Spogli, J.~Tong, G.~De~Franceschi, V.~Romano, A.~Bourdillon,
  M.~Le~Huy, and C.~Mitchell (2011{\natexlab{b}}), {GPS} scintillation and
  {TEC} gradients at equatorial latitudes in april 2006, \textit{Advances in
  Space Research}, \textit{47}(10), 1750--1757,
  \doi{10.1016/j.asr.2010.04.020}.

\bibitem[{\textit{Anghel et~al.}(2008)\textit{Anghel, Astilean, Letia, and
  Komjathy}}]{anghel2008nrm}
Anghel, A., A.~Astilean, T.~Letia, and A.~Komjathy (2008), Near real-time
  monitoring of the ionosphere using dual frequency {GPS} data in a kalman
  filter approach, in \textit{{IEEE} International Conference on Automation,
  Quality and Testing, Robotics, 2008. {AQTR} 2008}, vol.~2, pp. 54--58,
  \doi{10.1109/AQTR.2008.4588793}.

\bibitem[{\textit{Baker and Wing}(1989)}]{baker1989nmc}
Baker, K.~B., and S.~Wing (1989), A new magnetic coordinate system for
  conjugate studies at high latitudes, \textit{Journal of Geophysical Research:
  Space Physics}, \textit{94}(A7), 9139--9143, \doi{10.1029/JA094iA07p09139}.

I want to match the whole \bibitem command for a single entry (with some capture groups) if I know the reference code at the end of the command. I use this regex, which works for the first entry, but not for the rest (second entry exemplified below):

\\bibitem\[{(.*?)\((.*?)\)(.*?)}\]{alfonsi2011gsa}

This doesn't work, since it matches everything from the start of the first \bibitem command to the end of the second \bibitem command. How can I match only the second \bibitem command? I have tried using a negative lookahead for ^$ and \n\n, but I couldn't get either to work - basically, I want the third (.*?) to match any string not including two consecutive newlines. (If there's any other way to do this, I'm all ears.)

Upvotes: 2

Views: 105

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

regex is not my strong point but this will get all the content you want without reading all the content into memory at once:

from itertools import groupby
import re
with open("file.txt") as f:
    r = re.compile(r"\[{(.*?)\((.*?)\)(.*?)}\]\{alfonsi2011gsa\}")
    for k, v in groupby(map(str.strip, f), key=lambda x: bool(x.strip())):
        match = r.search("".join(v))
        if match:
             print(match.groups())


('\\textit{Alfonsi et~al.}', '2011{\\natexlab{b}}', '\\textit{Alfonsi, Spogli,Tong, De~Franceschi, Romano, Bourdillon, Le~Huy, andMitchell}')  

Upvotes: 0

Honza Osobne
Honza Osobne

Reputation: 2719

You can use negative look-arounds (?!) to prevent the match from having multiple occurrences of 'bibitem'. With this, the match will start with the 'bibitem' which immediately precedes your reference code. This seems to work:

\\bibitem\[{(((?!bibitem).)*?)\((((?!bibitem).)*?)\)(((?!bibitem).)*?)}\]{alfonsi2011gsa}

Upvotes: 1

Related Questions