Maarten Kuilman
Maarten Kuilman

Reputation: 501

Search until next character with sed and regex

I got an image with an URL like:

<img alt="" src="http://www.example-site.com/folder_with_underscore/folder-with-dash/3635/0/235/NumBerS_and_Uc/image.png" />

I'm using sed "s///g"

So what I'm trying is to replace the src value but this is most of the time totally different.

Is there a way to use sed "s/src=\" (until first " ) / new url /g"

Extra info:

I'm using Cygwin on Windows and PATH=C:\cygwin\bin in my .bat file

Upvotes: 0

Views: 2625

Answers (2)

William Pursell
William Pursell

Reputation: 212664

Shawn's solution is mostly correct, but it does not deal with the case in which a newline appears in the src url. sed is really not very good at dealing with such cases, but you can hack a solution:

sed '/src/{
/src="[^"]*"/{ s//src="NEWURL"/; n; }
s/src=".*$/src="NEWURL"/
p
:a
s/.*//;
N
/"/!ba
s/[^"]*"//
}
' input

Note that many of the newlines above are superfluous in some versions of sed, but necessary in others. (In particular, the newline after :a and after the branch command, as some versions of sed will terminate the label only at the newline. I believe that versions of sed which allow a label to terminate with a semi-colon are not strictly compliant with the standard, but it is a common practice.) This script does the simple replacement where appropriate, but if a quote is not found following src=", it enters a loop deleting lines until a terminating " is seen. This is an ugly solution, and I recommend against using sed for parsing xml.

Upvotes: 1

Shawn Chin
Shawn Chin

Reputation: 86974

[^"] will match any charater apart from ", so you can use:

 sed 's/src="[^"]*"/src="NEWURL"/g'

Example:

[me@home]$ echo '<img alt="" src="http://www.example-site.com/folder_with_underscore/folder-with-dash/3635/0/235/NumBerS_and_Uc/image.png" />' | sed 's/src="[^"]*"/src="http:\/\/stackoverflow.com"/g'
<img alt="" src="http://stackoverflow.com" />

Note that that will match till the first occurence of " which is probably what you want. If you really want to match till the last occurence of ", you could simply do:

 sed 's/src=".*"/src="NEWURL"/g'

The regex is greedy and so will take up as many charactes as possibly, thus matching till the last occurence of ". While this will also work in the example above, it will not behave as expected if there are other contents within your input that also contain ".

Upvotes: 5

Related Questions