Ian Panzica
Ian Panzica

Reputation: 354

Match some regular expressions in specific lines and save those lines to an array in bash

I need to scan lines of text from a file for specific tags and store whatever is between those tags into an array in bash. The basic syntax is as follows:

<description> "long, multiline text descriptions" </description>

Where the text in between gets stored in an array.

Upvotes: 0

Views: 99

Answers (4)

John B
John B

Reputation: 3646

This BASH solution should do the job.

arr=()
match=""
while read -r line; do
    if [[ $line =~ "<description>"(.*)"</description>" ]]; then
        arr+=("${BASH_REMATCH[1]}")
        continue
    elif [[ $line =~ "<description>"(.*) ]]; then
        match+="${BASH_REMATCH[1]}"
    fi
    if [[ $match ]] && [[ $line != *"</description>"* && $line != *"<description>"* ]]; then
        match+=" $line"
    elif [[ $match ]] && [[ $line =~ (.*)"</description>" ]]; then
        match+=" ${BASH_REMATCH[1]}"
        arr+=("$match")
        match=""
    fi
done < file

Upvotes: 1

anubhava
anubhava

Reputation: 785376

Read file line by line and check regex:

arr=()
while read -r s; do
    [[ "$s" =~ "<description>"(.*)"</description>" ]] && arr+=("${BASH_REMATCH[1]}")
done < file

Upvotes: 1

aks
aks

Reputation: 2346

First a little test script:

cat <<EOF >find-tag.sh
#!/bin/bash
# find-tag.sh TARGET XMLFILE
target="$1"
awk "/<description>/   { keep=1; s=\"\" ; next}
     /<\/description>/ { keep=0; if (s ~ /$target/) { print s } ; next}
     {if (keep) { s = s \$0 }}
    " $2
EOF

Next a little XML data file, with multi-line entries:

cat test.xml
<blah1>
  xxx
</blah1>
<description>
  blah1 blah1 blah1
  blah12 blah12 blah12
</description>
<description>
  blah2 blah2 blah2
  blah2a blah2a blah2a
</description>
<description>
  blah3 blah3 blah3
  blah3b blah3b blah3b
</description>
<description>
  blah4 blah4 blah4
  blah4c blah4c blah4c
</description>

Finally, the sample run -- each line of data from find-tag.sh is an entry, which is read into the array data. The data array is then displayed, one item per line.

data=()
# extract tagged entries to a file
./find-tag.sh blah test.xml >/tmp/xml

# read the XML extracts into the 'data' array
readarray -t data </tmp/xml

If your bash doesn't support readarray, use this instead:

while read line ; do data=( "${data[@]}" "$line" ) ; done </tmp/xml

Display the data (show the array items):

for ((i=0; i<${#data[@]}; i++)) ; do printf "line %d: %s\n" $i "${data[$i]}" ; done
line 0:   blah1 blah1 blah1  blah12 blah12 blah12  blah13 blah13 blah13
line 1:   blah2 blah2 blah2  blah2a blah2a blah2a
line 2:   blah3 blah3 blah3  blah3b blah3b blah3b
line 3:   blah4 blah4 blah4  blah4c blah4c blah4c

BTW, you can find some useful array management utilities for bash at https://github.com/aks/bash-lib#list_utils

Upvotes: 0

Charles Duffy
Charles Duffy

Reputation: 295579

This implementation requires xmlstarlet (http://xmlstar.sourceforge.net/) and assumes that no EOT (end-of-transmission) characters exist within the contents. It has the advantages associated with being based on a real XML parser -- entities are processed, comments ignored, CDATA interpreted literally, etc.

descriptions=()
while IFS='' read -r -d $'\x04'; do
  descriptions+=( "$REPLY" )
done < <(xmlstarlet sel -m -t -v //description -o $'\x04')

Upvotes: 1

Related Questions