Reputation: 573
I have the following text input:
lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed
do eiusmod <f> tempor
incididunt ut
As seen in the text, the appearances of <?> is not fixed and can appear 0 or multiple times on the same line.
Only using awk I need to output this:
<a> <b> <c>
<d> <e>
<f>
I tried this awk script:
awk '{
match($0,/<[^>]+>/,a); // fill array a with matches
for (i in a) {
if (match(i, /^[0-9]+$/) != 0) // ignore non numeric indices
print a[i]
}
}' somefile.txt
but this only outputs the first match on every line:
<a>
<d>
<f>
Is there some way of doing this with match()
or any other built-in function?
Upvotes: 18
Views: 1295
Reputation: 2915
if you really wanna do it the patmatch()
way, here's how to emulate that effect in other awk
s :
echo 'lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed
do eiusmod <f> tempor
incididunt ut' |
awk '
BEGIN {
RS = "^$"
} _ = gsub(/[<][^>]*[>]/, "\4&\5") {
split($!_, __, /((^|\5)[^\4]*)\4|\5[^\4]*$/)
for (_ in __)
print _, __[_]
}'
1
2 <a>
3 <b>
4 <c>
5 <d>
6 <e>
7 <f>
8
Upvotes: 0
Reputation: 2915
INPUT
lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed
do eiusmod <f> tempor
incididunt ut
CODE
mawk -F'^[^<]+|[^>]+$' 'gsub(">[^<]*<","> <",$!(NF=NF))^_*/./' OFS=
OUTPUT
<a> <b> <c>
<d> <e>
<f>
Upvotes: 5
Reputation: 17290
Here's a simple awk
solution based on regexps:
awk '{ gsub(/^[^<]*|[^>]*$/,""); gsub(/>[^<]*</,"> <") } NF'
edit: using NF
instead of $0 != ""
; thanks @EdMorton
For each line:
<
(excluded) or up to the end-of-line when <
isn't found.>
(excluded) or up to the start-of-line when >
isn't found.>
and <
pair with a space character.lorem <a a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed
do eiusmod <f> tempor
<g>incididunt ut<h><i>
h>ell<o
<j>
<a a> <b> <c>
<d> <e>
<f>
<g> <h> <i>
<j>
Remark: With exactly the same logic you can use sed
:
sed 's/^[^<]*//; s/[^>]*$//; s/>[^<]*</> </g; /^$/d'
Upvotes: 9
Reputation: 163632
Another option is to use gnu awk
with gensub. You can capture the angle brackets with optional surrounding spaces and match the rest.
In the replacement use group 1 surrounded with a single space.
awk '{$0 = gensub(/ *(<[^>]*>) *|[^<>]+/, " \\1 ", "g"); $1=$1}1' file
Output
<a> <b> <c>
<d> <e>
<f>
Upvotes: 4
Reputation: 786091
Here is a simple gnu-awk
alternative solution using patsplit
:
awk '
n = patsplit($0, m, /<[^>]+>/) {
for (i=1; i<=n; ++i)
printf "%s", m[i] (i < n ? OFS : ORS)
}' file
<a> <b> <c>
<d> <e>
<f>
Upvotes: 8
Reputation: 36765
I would harness GNU AWK
for this task following way, let file.txt
content be
lorem <a> ipsum <b> dolor <c> sit amet,
consectetur <d> adipiscing elit <e>, sed
do eiusmod <f> tempor
incididunt ut
then
awk 'BEGIN{FPAT="<[^>]*>"}{$1=$1;print}' file.txt
gives output
<a> <b> <c>
<d> <e>
<f>
Explanation: I inform GNU AWK
that field is <
followed by zero-or-more (*
) non(^
)->
followed by >
. For each line I do $1=$1
to provoke rebuilt, so now line are found fields joined by space, which I then print
.
(tested in gawk 4.2.1)
Upvotes: 9
Reputation: 247200
Assuming there are no stray angle brackets, use either <
or >
as a field separator and print every second field:
awk -F'[<>]' '{for (i=2; i <= NF; i += 2) {printf "<%s> ", $i}; print ""}' data
Upvotes: 11
Reputation: 133760
With GNU awk
you could use its OOTB variable named FPAT
and could try following awk
code.
awk -v FPAT='<[^>]*>' '
NF{
val=""
for(i=1;i<=NF;i++){
val=(val?val OFS:"") $i
}
print val
}
' Input_file
Upvotes: 16
Reputation: 35366
match()
doesn't work the way you think it does; to find a variable number of matches you would need to first match()
the first pattern, strip off that pattern, then match()
the remainder of the input for the next pattern, and repeat until no more matches in the current line; eg:
awk '
{ out=sep="" # init variables for new line
while (match($0,/<[^>]+>/)) { # find 1st match
out=out sep substr($0,RSTART,RLENGTH) # build up output line
$0=substr($0,RSTART+RLENGTH) # strip off 1st match and prep for next while() check
sep=OFS # set field separator for follow-on matches
}
if (out) print out
}' somefile.txt
Another idea uses the split()
function, eg:
awk '
{ n=split($0,a,/[<>]/) # split line on dual delimiters "<" and ">"
out=sep=""
for (i=2;i<=n;i=i+2) { # step through even numbered array entries; assumes line does not contain any standalone "<" or ">" characters !!!
out=out sep "<" a[i] ">" # build output line
sep=OFS
}
if (out) print out
}
' somefile.txt
Both of these generate:
<a> <b> <c>
<d> <e>
<f>
Upvotes: 10