Shearn
Shearn

Reputation: 43

fetch text between multiple strings on the same line

I would like to use bash on a file to extract text that lies between two strings. There are already some answers to this, eg:

Print text between two strings on the same line

But I would like to do this for multiple occurrences, sometimes on the same line, sometimes on new lines. for example, starting with a file like this:

\section{The rock outcrop pools experimental system} \label{intro:rockpools}
contain pools at their summit \parencite{brendonck_pools_2010} that have weathered into the rock over time \parencite{bayly_aquatic_2011} through chemical weathering after water collecting at the rock surface \parencite{lister_microgeomorphology_1973}.
Classification depends on dimensions \parencite{twidale_gnammas_1963}.

I would like to retrieve:

brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

I imagine sed should be able to do this but I'm not sure where to start.

Upvotes: 0

Views: 946

Answers (3)

potong
potong

Reputation: 58351

This might work for you (GNU sed):

sed '/\\parencite{\([^}]*\)}/!d;s//\n\1\n/;s/^[^\n]*\n//;P;D' file

Delete any lines that don't contain the required string. Surround the first occurrance with newlines and remove upto and including the first newline. Print upto and including the following newline then delete what was printed and repeat.

Upvotes: 0

anubhava
anubhava

Reputation: 784898

Using grep -oP;

grep -oP '\\parencite\{\K[^}]+' file
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

Or using gnu-awk:

awk -v FPAT='\\\\parencite{[^}]+' '{for (i=1; i<=NF; i++) {
    sub(/\\parencite{/, "", $i); print $i}}' file
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

Upvotes: 1

karakfa
karakfa

Reputation: 67467

This two stage extract might be easier to understand, without using Perl regex.

$ grep -o "parencite{[^}]*}" cite | sed 's/parencite{//;s/}//'
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

or, as always awk to the rescue!

$ awk -F'[{}]' -v RS=" " '/parencite/{print $2}' cite
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

Upvotes: 1

Related Questions