Reputation: 5919
To make this problem simple to demonstrate, I made a fake xml file like this.
<abc>
<spirit:addressBlock>
<spirit:name>cmn700_registers</spirit:name>
<def>
</def>
</spirit:addressBlock>
</abc>
And I want to print lines containing pattern <spirit:name>
inside a block of lines, the block begining with the pattern <spirit:addressBlock>
and ending with </spirit:addressBlock>
. I defined a function in .bash_aliase like this.
function SearchPatInBlk {
awk "/$1/{inblk=1} inblk==1&&/$2/{inblk=0} inblk==1&&/$3/{print \$0}" $4
}
So the first argument and second argument is the block start and end pattern, third argument is the pattern I want to print the line with and the fourth argument is the xml filename. And then I gave this command at the bash shell.
SearchPatInBlk <spirit:addressBlock> </spirit:addressBlock> <spirit:name> ../../ab21/ab21_cmn700_new10_clst/build/ab21_cmn700/logical/cmn700/ipxact/cmn700_ab21.xml
Of course this gives me an error.
bash: syntax error near unexpected token `<'
So I tried putting some escape characters (\
) before <,>,/ but it doesn't work. How should I do it?
Upvotes: 0
Views: 98
Reputation: 36033
Don't use text processors like sed
or awk
on structured data. Use a (command-line) XML processor instead. Here are some options:
(Note that the sample given by itself isn't valid XML as it doesn't declare the spirit
namespace, which makes the elements spirit:addressBlock
and spirit:name
lack their binding, and eventually trips up some of these processors, so you might want to add something along the lines of <abc … xmlns:spirit="…" … >
to the sample. But if your actual document does declare them properly, you'll be fine using any of these.
Using xmlstarlet:
select
(or sel
) command makes xmlstarlet
perform a query on the input, and the --template
(or -t
) flag starts a new template with the --var
option importing a quoted (hence the @Q
) value into a variable, and the -value-of
(or -v
) option evaluating an XPath expression used for extraction and printingSearchPatInBlk() {
xmlstarlet sel -t --var fst="${1@Q}" --var snd="${2@Q}" \
-v '//*[name() = $fst]//*[name() = $snd]/text()' "$3"
}
SearchPatInBlk spirit:addressBlock spirit:name input.xml
local-name()
instead of their name()
(but properly declared namespaces in the document are still required)SearchPatInBlk() {
xmlstarlet sel -t --var fst="${1@Q}" --var snd="${2@Q}" \
-v '//*[local-name() = $fst]//*[local-name() = $snd]/text()' "$3"
}
SearchPatInBlk addressBlock name input.xml
Using libxml/xmllint:
--xpath
option, this also uses an XPath expression to query and extract, but has no means to import external values, so the function's arguments are injected directly into the expression (note the double quotes around it)SearchPatInBlk() {
xmllint --xpath "//*[name() = ${1@Q}]//*[name() = ${2@Q}]/text()" "$3"
}
SearchPatInBlk spirit:addressBlock spirit:name input.xml
name()
to local-name()
in order to allow for omitting the namespaces in the function call can be applied here as well (with the namespaces still required to be properly declared in the document)SearchPatInBlk() {
xmllint --xpath \
"//*[local-name() = ${1@Q}]//*[local-name() = ${2@Q}]/text()" "$3"
}
SearchPatInBlk addressBlock name input.xml
Using kislyuk/yq:
spirit:name
)--arg
option imports values into variables, and the --raw-output
(or -r
) flag decodes the result value into raw text (as otherwise it would be JSON-encoded, because under the hood it uses the JSON processor jq
)SearchPatInBlk() {
xq --arg fst "$1" --arg snd "$2" -r \
'..[$fst]? | ..[$snd]? | arrays[] // values' "$3"
}
SearchPatInBlk spirit:addressBlock spirit:name input.xml
Using mikefarah/yq:
--input-format xml
(or -px
) option), while the output is deliberately encoded as YAML using the --output-format yaml
(or -oy
) option to unquote the resultsSearchPatInBlk() {
fst="$1" snd="$2" yq -oy \
'.. | .[strenv(fst)]? | .. | [] + .[strenv(snd)]? | .[]' "$3"
}
SearchPatInBlk addressBlock name input.xml
Upvotes: 2
Reputation: 2809
echo '<abc>
<spirit:addressBlock>
<spirit:name>cmn700_registers</spirit:name>
<def>
</def>
</spirit:addressBlock>
</abc>' |
mawk 'gsub(/^[^<]*|[^>]+$/,_, $!(NF *= (NF % 2) < (2 < NF)))' \
ORS= FS='[<][/]' OFS='</' RS='spirit:addressBlock[>]'
1 <spirit:name>cmn700_registers</spirit:name>
2 <def>
3 </def>
Upvotes: 0
Reputation: 28910
Using a true XML parser would be better than a general purpose text processor like awk
. But if you absolutely need awk
there are several things to fix.
awk
as awk
variables, not as parts of the awk
script.regex,regex
awk
range pattern.Optionally you could also use more accurate regex and, if your awk
is GNU awk
, mark the patterns as regex constants (@/.../
):
function SearchPatInBlk {
awk -v v1="$1" -v v2="$2" -v v3="$3" 'v1,v2 {if($0 ~ v3) print}' "$4"
}
SearchPatInBlk '@/^[[:space:]]*[<]spirit:addressBlock[>][[:space:]]*$/' \
'@/^[[:space:]]*[<][/]spirit:addressBlock[>][[:space:]]*$/' \
'@/[<]spirit:name[>]' file
Upvotes: 0