zzart
zzart

Reputation: 11473

sed don't match characters inside parenthesis

I'm trying to come up with a SED greedy expression which ignores the stuff inside html quotes and ONLY matches the text of that element.

<p alt="100">100</p> #need to match only second 100
<img src="100.jpg">100</img> #need to match only second 100
<span alt="tel:100">100</span> #need to match only second 100

These are my attempts:

grep -E '(!?\")100(!?\")' html # this matches string as well as quotes 
grep -E '[^\"]100[^\"]' html # this doesn't work either

Edit

Ok. I was trying to simplify the question but maybe that's wrong.

with command sed -r '/?????/__replaced__/g' file i would need to see :

<p alt="100">__replaced__</p>
<img src="100.jpg">__replaced__</img> 
<span alt="tel:100">__replaced__</span> 

Upvotes: 0

Views: 202

Answers (4)

Ed Morton
Ed Morton

Reputation: 203368

You're questions gotten kinda muddy through it's evolution but is this what you're asking for?

$ sed -r 's/>[^<]+</>__replaced__</' file
<p alt="100">__replaced__</p> #need to match only second 100
<img src="100.jpg">__replaced__</img> #need to match only second 100
<span alt="tel:100">__replaced__</span> #need to match only second 100

If not please clean up your question to just show the latest sample input and expected output and explanation.

Upvotes: 0

Wintermute
Wintermute

Reputation: 44023

I don't think handling HTML with sed (or grep) is a good idea. Consider using python, which has an HTML push parser in its standard library. This makes separating tags from data easy. Since you only want to handle the data between tags, it could look something like this:

#!/usr/bin/python

from HTMLParser import HTMLParser
from sys import argv

class MyParser(HTMLParser):
    def handle_data(self, data):
        # data is the string between tags. You can do anything you like with it.
        # For a simple example:
        if data == "100":
            print data

# First command line argument is the HTML file to handle.
with open(argv[1], "r") as f:
    MyParser().feed(f.read())

Update for updated question: To edit HTML with this, you'll have to implement the handle_starttag and handle_endtag methods as well as handle_data in a manner that reprints the parsed tags. For example:

#!/usr/bin/python

from HTMLParser import HTMLParser
from sys import stdout, argv
import re

class MyParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        stdout.write("<" + tag)
        for k, v in attrs:
            stdout.write(' {}="{}"'.format(k, v))
        stdout.write(">")

    def handle_endtag(self, tag):
        stdout.write("</{}>".format(tag))

    def handle_data(self, data):
        data = re.sub("100", "__replaced__", data)
        stdout.write(data)

with open(argv[1], "r") as f:
    MyParser().feed(f.read())

Upvotes: 4

Sobrique
Sobrique

Reputation: 53478

First warning is that HTML is not a good idea to parse with regular expressions - generally speaking - use an HTML parser is the answer. Most scripting languages (perl, python etc.) have HTML parsers.

See here for an example as to why: RegEx match open tags except XHTML self-contained tags

If you really must though:

/(?!\>)([^<>]+)(?=\<)/

DEMO

Upvotes: 2

Avinash Raj
Avinash Raj

Reputation: 174706

You may try the below PCRE regex.

grep -oP '"[^"]*100[^"]*"(*SKIP)(*F)|\b100\b' file

or

grep -oP '"[^"]*"(*SKIP)(*F)|\b100\b' file

This would match the number 100 which was not present inside double quotes.

DEMO

Upvotes: 1

Related Questions