Reputation: 11473
I'm trying to come up with a SED greedy expression which ignores the stuff inside html quotes and ONLY matches the text of that element.
<p alt="100">100</p> #need to match only second 100
<img src="100.jpg">100</img> #need to match only second 100
<span alt="tel:100">100</span> #need to match only second 100
These are my attempts:
grep -E '(!?\")100(!?\")' html # this matches string as well as quotes
grep -E '[^\"]100[^\"]' html # this doesn't work either
Ok. I was trying to simplify the question but maybe that's wrong.
with command sed -r '/?????/__replaced__/g' file
i would need to see :
<p alt="100">__replaced__</p>
<img src="100.jpg">__replaced__</img>
<span alt="tel:100">__replaced__</span>
Upvotes: 0
Views: 202
Reputation: 203368
You're questions gotten kinda muddy through it's evolution but is this what you're asking for?
$ sed -r 's/>[^<]+</>__replaced__</' file
<p alt="100">__replaced__</p> #need to match only second 100
<img src="100.jpg">__replaced__</img> #need to match only second 100
<span alt="tel:100">__replaced__</span> #need to match only second 100
If not please clean up your question to just show the latest sample input and expected output and explanation.
Upvotes: 0
Reputation: 44023
I don't think handling HTML with sed (or grep) is a good idea. Consider using python, which has an HTML push parser in its standard library. This makes separating tags from data easy. Since you only want to handle the data between tags, it could look something like this:
#!/usr/bin/python
from HTMLParser import HTMLParser
from sys import argv
class MyParser(HTMLParser):
def handle_data(self, data):
# data is the string between tags. You can do anything you like with it.
# For a simple example:
if data == "100":
print data
# First command line argument is the HTML file to handle.
with open(argv[1], "r") as f:
MyParser().feed(f.read())
Update for updated question: To edit HTML with this, you'll have to implement the handle_starttag
and handle_endtag
methods as well as handle_data
in a manner that reprints the parsed tags. For example:
#!/usr/bin/python
from HTMLParser import HTMLParser
from sys import stdout, argv
import re
class MyParser(HTMLParser):
def handle_starttag(self, tag, attrs):
stdout.write("<" + tag)
for k, v in attrs:
stdout.write(' {}="{}"'.format(k, v))
stdout.write(">")
def handle_endtag(self, tag):
stdout.write("</{}>".format(tag))
def handle_data(self, data):
data = re.sub("100", "__replaced__", data)
stdout.write(data)
with open(argv[1], "r") as f:
MyParser().feed(f.read())
Upvotes: 4
Reputation: 53478
First warning is that HTML is not a good idea to parse with regular expressions - generally speaking - use an HTML parser is the answer. Most scripting languages (perl
, python
etc.) have HTML parsers.
See here for an example as to why: RegEx match open tags except XHTML self-contained tags
If you really must though:
/(?!\>)([^<>]+)(?=\<)/
Upvotes: 2
Reputation: 174706
You may try the below PCRE regex.
grep -oP '"[^"]*100[^"]*"(*SKIP)(*F)|\b100\b' file
or
grep -oP '"[^"]*"(*SKIP)(*F)|\b100\b' file
This would match the number 100 which was not present inside double quotes.
Upvotes: 1