aee
aee

Reputation: 573

AWK print all regex matches on every line

I have the following text input:

lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
incididunt ut

As seen in the text, the appearances of <?> is not fixed and can appear 0 or multiple times on the same line.

Only using awk I need to output this:

<a> <b> <c>
<d> <e>
<f>

I tried this awk script:

awk '{
  match($0,/<[^>]+>/,a);           // fill array a with matches
  for (i in a) {
    if (match(i, /^[0-9]+$/) != 0) // ignore non numeric indices
      print a[i]
  }
}' somefile.txt

but this only outputs the first match on every line:

<a>
<d>
<f>

Is there some way of doing this with match() or any other built-in function?

Upvotes: 18

Views: 1295

Answers (9)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2915

if you really wanna do it the patmatch() way, here's how to emulate that effect in other awks :

echo 'lorem <a> ipsum <b> dolor <c> sit amet,
       consectetur <d> adipiscing elit <e>, sed
       do eiusmod <f> tempor
       incididunt ut' | 
awk '
BEGIN { 
    RS = "^$" 
}   _ = gsub(/[<][^>]*[>]/, "\4&\5") {

     split($!_, __, /((^|\5)[^\4]*)\4|\5[^\4]*$/)

     for (_ in __)
         print _, __[_]
}' 
1 
2 <a>
3 <b>
4 <c>
5 <d>
6 <e>
7 <f>
8 

Upvotes: 0

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2915

INPUT

lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
incididunt ut

CODE

mawk -F'^[^<]+|[^>]+$' 'gsub(">[^<]*<","> <",$!(NF=NF))^_*/./' OFS=

OUTPUT

<a> <b> <c>
<d> <e>
<f>

Upvotes: 5

Fravadona
Fravadona

Reputation: 17290

Here's a simple awk solution based on regexps:

awk '{ gsub(/^[^<]*|[^>]*$/,""); gsub(/>[^<]*</,"> <") } NF'

edit: using NF instead of $0 != ""; thanks @EdMorton

For each line:

  • strip all chars from the left up to the first < (excluded) or up to the end-of-line when < isn't found.
  • strip all chars from the right up to the first > (excluded) or up to the start-of-line when > isn't found.
  • replace what's between each > and < pair with a space character.
  • print the result when it isn't empty
example
lorem <a a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
<g>incididunt ut<h><i>
h>ell<o
<j>
output
<a a> <b> <c>
<d> <e>
<f>
<g> <h> <i>
<j>

Remark: With exactly the same logic you can use sed:

sed 's/^[^<]*//; s/[^>]*$//; s/>[^<]*</> </g; /^$/d'

Upvotes: 9

The fourth bird
The fourth bird

Reputation: 163632

Another option is to use gnu awk with gensub. You can capture the angle brackets with optional surrounding spaces and match the rest.

In the replacement use group 1 surrounded with a single space.

awk '{$0 = gensub(/ *(<[^>]*>) *|[^<>]+/, " \\1 ", "g"); $1=$1}1' file

Output

<a> <b> <c>
<d> <e>
<f>

Upvotes: 4

anubhava
anubhava

Reputation: 786091

Here is a simple gnu-awk alternative solution using patsplit:

awk '
n = patsplit($0, m, /<[^>]+>/) {
   for (i=1; i<=n; ++i)
      printf "%s", m[i] (i < n ? OFS : ORS)
}' file

<a> <b> <c>
<d> <e>
<f>

Upvotes: 8

Daweo
Daweo

Reputation: 36765

I would harness GNU AWK for this task following way, let file.txt content be

lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed 
do eiusmod <f> tempor
incididunt ut

then

awk 'BEGIN{FPAT="<[^>]*>"}{$1=$1;print}' file.txt

gives output

<a> <b> <c>
<d> <e>
<f>

Explanation: I inform GNU AWK that field is < followed by zero-or-more (*) non(^)-> followed by >. For each line I do $1=$1 to provoke rebuilt, so now line are found fields joined by space, which I then print.

(tested in gawk 4.2.1)

Upvotes: 9

glenn jackman
glenn jackman

Reputation: 247200

Assuming there are no stray angle brackets, use either < or > as a field separator and print every second field:

awk -F'[<>]' '{for (i=2; i <= NF; i += 2) {printf "<%s> ", $i}; print ""}' data

Upvotes: 11

RavinderSingh13
RavinderSingh13

Reputation: 133760

With GNU awk you could use its OOTB variable named FPAT and could try following awk code.

awk -v FPAT='<[^>]*>' '
NF{
  val=""
  for(i=1;i<=NF;i++){
    val=(val?val OFS:"") $i
  }
  print val
}
'  Input_file

Upvotes: 16

markp-fuso
markp-fuso

Reputation: 35366

match() doesn't work the way you think it does; to find a variable number of matches you would need to first match() the first pattern, strip off that pattern, then match() the remainder of the input for the next pattern, and repeat until no more matches in the current line; eg:

awk '
{ out=sep=""                                     # init variables for new line
  while (match($0,/<[^>]+>/)) {                  # find 1st match
        out=out sep substr($0,RSTART,RLENGTH)    # build up output line
        $0=substr($0,RSTART+RLENGTH)             # strip off 1st match and prep for next while() check
        sep=OFS                                  # set field separator for follow-on matches
  }
  if (out) print out
}' somefile.txt

Another idea uses the split() function, eg:

awk '
{ n=split($0,a,/[<>]/)                           # split line on dual delimiters "<" and ">"
  out=sep=""
  for (i=2;i<=n;i=i+2) {                         # step through even numbered array entries; assumes line does not contain any standalone "<" or ">" characters !!!
      out=out sep "<" a[i] ">"                   # build output line
      sep=OFS 
  }
  if (out) print out
}
' somefile.txt

Both of these generate:

<a> <b> <c>
<d> <e>
<f>

Upvotes: 10

Related Questions