Reputation: 354
I need to scan lines of text from a file for specific tags and store whatever is between those tags into an array in bash. The basic syntax is as follows:
<description> "long, multiline text descriptions" </description>
Where the text in between gets stored in an array.
Upvotes: 0
Views: 99
Reputation: 3646
This BASH solution should do the job.
arr=()
match=""
while read -r line; do
if [[ $line =~ "<description>"(.*)"</description>" ]]; then
arr+=("${BASH_REMATCH[1]}")
continue
elif [[ $line =~ "<description>"(.*) ]]; then
match+="${BASH_REMATCH[1]}"
fi
if [[ $match ]] && [[ $line != *"</description>"* && $line != *"<description>"* ]]; then
match+=" $line"
elif [[ $match ]] && [[ $line =~ (.*)"</description>" ]]; then
match+=" ${BASH_REMATCH[1]}"
arr+=("$match")
match=""
fi
done < file
Upvotes: 1
Reputation: 785376
Read file line by line and check regex:
arr=()
while read -r s; do
[[ "$s" =~ "<description>"(.*)"</description>" ]] && arr+=("${BASH_REMATCH[1]}")
done < file
Upvotes: 1
Reputation: 2346
First a little test script:
cat <<EOF >find-tag.sh
#!/bin/bash
# find-tag.sh TARGET XMLFILE
target="$1"
awk "/<description>/ { keep=1; s=\"\" ; next}
/<\/description>/ { keep=0; if (s ~ /$target/) { print s } ; next}
{if (keep) { s = s \$0 }}
" $2
EOF
Next a little XML data file, with multi-line entries:
cat test.xml
<blah1>
xxx
</blah1>
<description>
blah1 blah1 blah1
blah12 blah12 blah12
</description>
<description>
blah2 blah2 blah2
blah2a blah2a blah2a
</description>
<description>
blah3 blah3 blah3
blah3b blah3b blah3b
</description>
<description>
blah4 blah4 blah4
blah4c blah4c blah4c
</description>
Finally, the sample run -- each line of data from find-tag.sh
is an entry, which is read into the array data
. The data
array is then displayed, one item per line.
data=()
# extract tagged entries to a file
./find-tag.sh blah test.xml >/tmp/xml
# read the XML extracts into the 'data' array
readarray -t data </tmp/xml
If your bash doesn't support readarray
, use this instead:
while read line ; do data=( "${data[@]}" "$line" ) ; done </tmp/xml
Display the data (show the array items):
for ((i=0; i<${#data[@]}; i++)) ; do printf "line %d: %s\n" $i "${data[$i]}" ; done
line 0: blah1 blah1 blah1 blah12 blah12 blah12 blah13 blah13 blah13
line 1: blah2 blah2 blah2 blah2a blah2a blah2a
line 2: blah3 blah3 blah3 blah3b blah3b blah3b
line 3: blah4 blah4 blah4 blah4c blah4c blah4c
BTW, you can find some useful array management utilities for bash at https://github.com/aks/bash-lib#list_utils
Upvotes: 0
Reputation: 295579
This implementation requires xmlstarlet (http://xmlstar.sourceforge.net/) and assumes that no EOT (end-of-transmission) characters exist within the contents. It has the advantages associated with being based on a real XML parser -- entities are processed, comments ignored, CDATA interpreted literally, etc.
descriptions=()
while IFS='' read -r -d $'\x04'; do
descriptions+=( "$REPLY" )
done < <(xmlstarlet sel -m -t -v //description -o $'\x04')
Upvotes: 1