Pedro
Pedro

Reputation: 125

Get list of strings between certain strings in bash

Given a text file (.tex) which may contain strings of the form "\cite{alice}", "\cite{bob}", and so on, I would like to write a bash script that stores the content within brackets of each such string ("alice" and "bob") in a new text file (say, .txt). In the output file I would like to have one line for each such content, and I would also like to avoid repetitions.

Attempts:

Upvotes: 0

Views: 83

Answers (2)

tripleee
tripleee

Reputation: 189337

You can use grep -o and postprocess its output:

grep -o '\\cite{[^{}]*}' file.tex |
sed 's/\\cite{\([^{}]*\)}/\1/'

If there can only ever be a single \cite on an input line, just a sed script suffices.

sed -n 's/.*\\cite{\([^{}]*\)}.*/\1/p' file.tex

(It's by no means impossible to refactor this into a script which extracts multiple occurrences per line; but good luck understanding your code six weeks from now.)

As usual, add sort -u to remove any repetitions.

Here's a brief Awk attempt:

awk -v RS='\' '/^cite\{/ {
    split($0, g, /[{}]/)
    cite[g[2]]++ }
  END { for (cit in cite) print cit }' file.tex

This conveniently does not print any duplicates, and trivially handles multiple citations per line.

Upvotes: 2

Amessihel
Amessihel

Reputation: 6374

What about:

grep -oP '(?<=\\cite{)[^}]+(?=})' sample.tex | sort -u > cites.txt
  • -P with GNU grep interprets the regexp as a Perl-compatible one (for lookbehind and lookahead groups)
  • -o "prints only the matched (non-empty) parts of a matching line, with each such part on a separate output line" (see manual)
  • The regexp matches a curly-brace-free text preceded by \cite{ (positive lookbehind group (?<=\\cite{)) and followed by a right curly brace (positive lookafter group (?=})).
  • sort -u sorts and remove duplicates

For more details about lookahead and lookbehind groups, see Regular-Expressions.info dedicated page.

Upvotes: 2

Related Questions