Vladislavs Dovgalecs
Vladislavs Dovgalecs

Reputation: 1631

Apply regex on matched substring

I have few thousands of text lines like this:

go to <CITY>rome</CITY> <COUNTRY>italy</COUNTRY>

My desired output is to replace everything from the first tagged word (rome) to the last one (italy) and put tag:

go to <ADDRESS>rome italy</ADDRESS>

I can match the portion of the text line which is tagged with:

<.*>

This will greedily select all text from first < to last >. I would like then the tags removed and put <ADDRESS> and </ADDRESS> around the matched portion.

The possible tags are: <STREETNUM>, <STREET>, <CITY>, <STATE>, <ZIP> and <COUNTRY>. Any subset of these tags can appear and in any order. The tags are never nested.

I have searched SO and googled to no avail. Perhaps I can use a named capturing group and then apply search/replace regex on it but I don't know how. Any help would appreciated.

Upvotes: 2

Views: 63

Answers (1)

dinox0r
dinox0r

Reputation: 16039

This sed line will do it:

sed 's/<CITY>\(.*\)<\/CITY>.*<COUNTRY>\(.*\)<\/COUNTRY>/<ADDRESS>\1 \2<\/ADDRESS> /g'

For example:

sed 's/<CITY>\(.*\)<\/CITY>.*<COUNTRY>\(.*\)<\/COUNTRY>/<ADDRESS>\1 \2<\/ADDRESS> /g'  <<< "go to <CITY>rome</CITY> <COUNTRY>italy</COUNTRY>"

It prints:

go to <ADDRESS>rome italy</ADDRESS> 

It basically captures what is inside the CITY tag and inside the COUNTRY tag and then replace them with the captured groups values enclose the ADDRESS tag


If you're using Linux, you can avoid escaping ( using the -E flag:

sed -E 's/<CITY>(.*)<\/CITY>.*<COUNTRY>(.*)<\/COUNTRY>/<ADDRESS>\1 \2<\/ADDRESS> /g'

UPDATE:

To achieve the expected result you could use several commands in the following order of operation:

  1. Remove the go to text: sed 's/go to //g'
  2. Remove all the tag characters: tr -d '</>'
  3. Once all tag chars are removed, you can safely delete the words STREETNUM, STREET, CITY, STATE, ZIP and COUNTRY from the input:

    sed -E 's/CITY|COUNTRY|STATE|ZIP|STREETNUM|STREET//g'

  4. Take the output generated from the previous commands concatenation and output it inside the <ADDRESS></ADDRESS> tags:

    xargs -i echo "go to <ADDRESS>{}</ADDRESS>"

The final command is the following, here $LINE should contain the line to process:

sed 's/go to //g' <<< "$LINE" | tr -d '</>' | sed -E 's/CITY|COUNTRY|STATE|ZIP|STREETNUM|STREET//g' | xargs -i echo "go to <ADDRESS>{}</ADDRESS>"

An example:

Running:

sed 's/go to //g' <<< "go to <STATE>Bolivar</STATE> <COUNTRY>Venezuela</COUNTRY> <STREETNUM>5</STREETNUM> " | tr -d '</>' | sed -E 's/CITY|COUNTRY|STATE|ZIP|STREETNUM|STREET//g' | xargs -i echo "go to <ADDRESS>{}</ADDRESS>"

Will print:

go to <ADDRESS>Bolivar Venezuela 5 </ADDRESS>

Upvotes: 2

Related Questions