Reputation: 8107
I am reading a page and trying to extract some data from it. I am interested in using bash and after going through few links, i came to know that 'Shell Parameter Expansion' might help however, i am finding difficulty using it in my script. I know that using sed might be easier but just for my knowledge i want to know how can i achieve this in bash.
shopt -s extglob
str='My work</u><br /><span style="color: rgb(34,34,34);"></span><span>abc-X7-27ABC | </span><span style="color: rgb(34,34,34);">build'
echo "${str//<.*>/|}"
I want my output to be like this: My work|abc-X7-27ABC |build
I thought of checking whether it accepts only word instead of pattern and it seems to be working with words.
For instance,
echo "${str//span style/|}"
works but
echo "${str//span.*style/|}"
doesn't
On the other hand, i saw in one of the link that it does accept pattern. I am confused why it's not working with the patern i am using above.
How to make sed do non-greedy match? (User konsolebox's solution)
Upvotes: 3
Views: 201
Reputation: 530843
This is not an answer, so much as a demonstration of why pattern-matching is not recommended for this kind of HTML editing. I attempted the following.
shopt -s extglob
set +H # Turn off history expansion, if necessary, to allow the !(...) pattern
echo ${str//+(<+(!(>))>)/|}
First: it didn't work, even for a simpler string like str='My work</u><br />bob<foo>build'
. Second, for the string in the original question, it appeared to lock up the shell; I suspect such a complex pattern triggers exponential backtracking.
Here's how it's intended to work:
!(>)
is any thing other than a single >
+(!(>))
is one or more non->
characters.<+(!(>))>
is one or more non->
characters enclosed in <
and >
+(<+(!(>))>)
is one or more groups of <...>
-enclosed non->
s.My theory is that since !(>)
can match a multi-character string as well as a single character, there is a ton of backtracking required.
Upvotes: 1
Reputation: 784878
One mistake you're making is by mixing shell globbing and regex. In shell glob
dot is taken literally as dot character not as 0 or more of any character.
If you try this code instead:
echo "${str//<*>/|}"
then it will print:
My work|build
Upvotes: 3