Technext
Technext

Reputation: 8107

Using pattern in Shell Parameter Expansion

I am reading a page and trying to extract some data from it. I am interested in using bash and after going through few links, i came to know that 'Shell Parameter Expansion' might help however, i am finding difficulty using it in my script. I know that using sed might be easier but just for my knowledge i want to know how can i achieve this in bash.

shopt -s extglob

str='My work</u><br /><span style="color: rgb(34,34,34);"></span><span>abc-X7-27ABC | </span><span style="color: rgb(34,34,34);">build'
echo "${str//<.*>/|}"

I want my output to be like this: My work|abc-X7-27ABC |build

I thought of checking whether it accepts only word instead of pattern and it seems to be working with words.

For instance,
echo "${str//span style/|}" works but
echo "${str//span.*style/|}" doesn't

On the other hand, i saw in one of the link that it does accept pattern. I am confused why it's not working with the patern i am using above.

How to make sed do non-greedy match? (User konsolebox's solution)

Upvotes: 3

Views: 201

Answers (2)

chepner
chepner

Reputation: 530843

This is not an answer, so much as a demonstration of why pattern-matching is not recommended for this kind of HTML editing. I attempted the following.

shopt -s extglob
set +H    # Turn off history expansion, if necessary, to allow the !(...) pattern
echo ${str//+(<+(!(>))>)/|}

First: it didn't work, even for a simpler string like str='My work</u><br />bob<foo>build'. Second, for the string in the original question, it appeared to lock up the shell; I suspect such a complex pattern triggers exponential backtracking.

Here's how it's intended to work:

  1. !(>) is any thing other than a single >
  2. +(!(>)) is one or more non-> characters.
  3. <+(!(>))> is one or more non-> characters enclosed in < and >
  4. +(<+(!(>))>) is one or more groups of <...>-enclosed non->s.

My theory is that since !(>) can match a multi-character string as well as a single character, there is a ton of backtracking required.

Upvotes: 1

anubhava
anubhava

Reputation: 784878

One mistake you're making is by mixing shell globbing and regex. In shell glob dot is taken literally as dot character not as 0 or more of any character.

If you try this code instead:

echo "${str//<*>/|}"

then it will print:

My work|build

Upvotes: 3

Related Questions