Get list of strings between certain strings in bash

Question

Given a text file (.tex) which may contain strings of the form "\cite{alice}", "\cite{bob}", and so on, I would like to write a bash script that stores the content within brackets of each such string ("alice" and "bob") in a new text file (say, .txt). In the output file I would like to have one line for each such content, and I would also like to avoid repetitions.

Attempts:

I thought about combining grep and cut. From other questions and answers that I have seen on Stack Exchange I think that (modulo reading up on cut a bit more) I could manage to get at least one such content per line, but I do not know how to get all occurences of a single line if there are several such strings in it and I have not seen any question or answer giving hints in this direction.
I have tried using sed as well. Yesterday I read this guide to see if I was missing some basic sed command, but I did not see any straightforward way to do what I want (the guide did mention that sed is Turing complete, so I am sure there is a way to do this only with sed, but I do not see how).

Amessihel · Accepted Answer

What about:

grep -oP '(?<=\cite{)[^}]+(?=})' sample.tex | sort -u > cites.txt

-P with GNU grep interprets the regexp as a Perl-compatible one (for lookbehind and lookahead groups)
-o "prints only the matched (non-empty) parts of a matching line, with each such part on a separate output line" (see manual)
The regexp matches a curly-brace-free text preceded by \cite{ (positive lookbehind group (?<=\cite{)) and followed by a right curly brace (positive lookafter group (?=})).
sort -u sorts and remove duplicates

For more details about lookahead and lookbehind groups, see Regular-Expressions.info dedicated page.

Get list of strings between certain strings in bash

Answers (2)

Related Questions