Himangshu Paul
Himangshu Paul

Reputation: 123

sed to get string between two patterns

I am working on a latex file from which I need to pick out the references marked by \citep{}. This is what I am doing using sed.

    cat file.tex | grep citep | sed 's/.*citep{\(.*\)}.*/\1/g'

Now this one works if there is only one pattern in a line. If there are more than one patterns i.e. \citep in a line, it fails. It fails even when there is only one pattern but more than one closing bracket }. What should I do, so that it works for all the patterns in a line and also for the exclusive bracket I am looking for?

I am working on bash. And a part of the file looks like this:

of the Asian crust further north \citep{TapponnierM76, WangLiu2009}. This has led to widespread deformation both within and 
\citep{BilhamE01, Mitraetal2005} and by distributed seismicity across the region (Fig. \ref{fig1_2}). Recent GPS Geodetic 
across the Dawki fault and Naga Hills, increasing eastwards from $\sim$3~mm/yr to $\sim$13~mm/yr \citep{Vernantetal2014}. 
GPS velocity vectors \citep{TapponnierM76, WangLiu2009}. Sikkim Himalaya lies at the transition between this relatively simple 
this transition includes deviation of the Himalaya from a perfect arc beyond 89\deg\ longitude \citep{BendickB2001}, reduction 
\citep{BhattacharyaM2009, Mitraetal2010}. Rivers Tista, Rangit and Rangli run through Sikkim eroding the MCT and Ramgarh 
thrust to form a mushroom-shaped physiography \citep{Mukuletal2009,Mitraetal2010}. Within this sinuous physiography, 
\citep{Pauletal2015} and also in accordance with the findings of \citet{Mitraetal2005} for northeast India. In another study 
field results corroborate well with seismic studies in this region \citep{Actonetal2011, Arunetal2010}. From studies of 

On one line, I get answer like this

    BilhamE01, TapponnierM76} and by distributed seismicity across the region (Fig. \ref{fig1_2

whereas I am looking for

    BilhamE01, TapponnierM76

Another example with more than one /citep patterns gives output like this

    Pauletal2015} and also in accordance with the findings of \citet{Mitraetal2005} for northeast India. In another study

whereas I am looking for

    Pauletal2015 Mitraetal2005

Can anyone please help?

Upvotes: 1

Views: 191

Answers (4)

slitvinov
slitvinov

Reputation: 5768

f.awk

BEGIN {
    pat = "\\citep"
    latex_tok = "\\\\[A-Za-z_][A-Za-z_]*" # match \aBcD
}

{
    f = f $0 # store content of input file as a sting
}

function store(args,   n, k, i) { # store `keys' in `d'
    gsub("[ \t]", "", args) # remove spaces
    n = split(args, keys, ",")
    for (i=1; i<=n; i++) {
      k = keys[i]
      d[k]
    }
}

function ntok() { # next token
    if (match(f, latex_tok)) {
      tok = substr(f, RSTART          ,RLENGTH)
      f   = substr(f, RSTART+RLENGTH-1        )
      return 1
    }
    return 0
}

function parse(    i, rc, args) {
    for (;;) { # infinite loop
      while ( (rc = ntok()) && tok != pat ) ;
      if (!rc) return

      i = index(f, "{")
      if (!i) return # see `pat' but no '{'
      f = substr(f, i+1)

      i = index(f, "}")
      if (!i) return # unmatched '}'

      # extract `args' from \citep{`args'}
      args = substr(f, 1, i-1)
      store(args)
    }
}

END {
    parse()
    for (k in d)
      print k
}

f.example

of the Asian crust further north \citep{TapponnierM76, WangLiu2009}. This has led to widespread deformation both within and 
\citep{BilhamE01, Mitraetal2005} and by distributed seismicity across the region (Fig. \ref{fig1_2}). Recent GPS Geodetic 
across the Dawki fault and Naga Hills, increasing eastwards from $\sim$3~mm/yr to $\sim$13~mm/yr \citep{Vernantetal2014}. 
GPS velocity vectors \citep{TapponnierM76, WangLiu2009}. Sikkim Himalaya lies at the transition between this relatively simple 
this transition includes deviation of the Himalaya from a perfect arc beyond 89\deg\ longitude \citep{BendickB2001}, reduction 
\citep{BhattacharyaM2009, Mitraetal2010}. Rivers Tista, Rangit and Rangli run through Sikkim eroding the MCT and Ramgarh 
thrust to form a mushroom-shaped physiography \citep{Mukuletal2009,Mitraetal2010}. Within this sinuous physiography, 
\citep{Pauletal2015} and also in accordance with the findings of \citet{Mitraetal2005} for northeast India. In another study 
field results corroborate well with seismic studies in this region \citep{Actonetal2011, Arunetal2010}. From studies of

Usage:

awk -f f.awk f.example

Expected ouput:

BendickB2001
Arunetal2010
Pauletal2015
Mitraetal2005
BilhamE01
Mukuletal2009
TapponnierM76
WangLiu2009
BhattacharyaM2009
Mitraetal2010
Actonetal2011
Vernantetal2014

Upvotes: 0

John Bollinger
John Bollinger

Reputation: 181932

For what it's worth, this can be done with sed:

echo "\citep{string} xyz {abc} \citep{string2},foo" | \
  sed 's/\\citep{\([^}]*\)}/\n\1\n\n/g; s/^[^\n]*\n//; s/\n\n[^\n]*\n/, /g; s/\n.*//g'

output:

string, string2

But wow, is that ugly. The sed script is more easily understood in this form, which happens to be suitable to be fed to sed via a -f argument:

# change every \citep{string} to <newline>string<newline><newline>
s/\\citep{\([^}]*\)}/\n\1\n\n/g

# remove any leading text before the first wanted string
s/^[^\n]*\n//

# replace text between wanted strings with comma + space
s/\n\n[^\n]*\n/, /g

# remove any trailing unwanted text
s/\n.*//

This makes use of the fact that sed can match and sub the newline character, even though reading a new line of input will not result in a newline initially appearing in the pattern space. The newline is the one character that we can be certain will appear in the pattern space (or in the hold space) only if sed puts it there intentionally.

The initial substitution is purely to make the problem manageable by simplifying the target delimiters. In principle, the remaining steps could be performed without that simplification, but the regular expressions involved would be horrendous.

This does assume that the string in every \citep{string} contains at least one character; if the empty string must be accommodated, too, then this approach needs a bit more refinement.

Of course, I can't imagine why anyone would prefer this to @Lev's straight grep approach, but the question does ask specifically for a sed solution.

Upvotes: 1

karakfa
karakfa

Reputation: 67567

it's a greedy match change the regex match the first closing brace

.*citep{\([^}]*\)}

test

$ echo "\citep{string} xyz {abc}" |  sed 's/.*citep{\([^}]*\)}.*/\1/'
string

note that it will only match one instance per line.

Upvotes: 3

Lev Levitsky
Lev Levitsky

Reputation: 65871

If you are using grep anyway, you can as well stick with it (assuming GNU grep):

$ echo $str | grep -oP '(?<=\\citep{)[^}]+(?=})'
BilhamE01, TapponierM76

Upvotes: 2

Related Questions