Tapas Bose
Tapas Bose

Reputation: 29806

sed: cut a string within a pattern

I have many XHTML files whose contents are like:

<h:panelGroup rendered="#{not accessBean.isUserLoggedIn}">
    <h:form>
        <p:panel style="margin-top:10px">
            <table style="margin:10px">
                <tbody>
                    <tr>
                        <td align="center">#{i.m['Login']}</td>
                        <td align="center">
                            <h:inputText value="#{accessBean.login}" />
                        </td>
                    </tr>
                    <tr>
                        <td align="center">#{i.m['Password']}</td>
                        <td align="center">
                            <h:inputSecret value="#{accessBean.password}" />
                        </td>
                    </tr>
                </tbody>
            </table>
            <p:commandButton ajax="false" value="#{i.m['Submit']}" action="#{accessBean.login}" />
        </p:panel>
    </h:form>
</h:panelGroup>

I want to replace every occurrence of #{i.m['any-string>']} with any-string, i.e., cut the string within the pattern.

I have created the following sed command

sed -e "s/#{i.m\['\(.*\)']}/\1/g"

And to run it recursively within a directory I could execute

find . -iname '*.xhtml' -type f -exec sed -i -e "s/#{i.m\['\(.*\)']}/\1/g" {} \;

Here, the any-string can be any human-readable HTML displayable character, i.e, alphabet, numbers, other characters etc. That's why I have used regex (.*).

But it seems to be not working perfectly.

Here are some tests I made using echo:

  1. $ echo "<td align=\"center\">#{i.m['Login']}</td>" | sed -e "s/#{i.m\['\(.*\)']}/\1/g"
    

    Result:

    <td align="center">Login</td>
    

    OK

  2. $ echo "<p:commandButton  ajax=\"false\" value=\"#{i.m['Submit']}\" action=\"#{accessBean.login}\" />" | sed -e "s/#{i.m\['\(.*\)']}/\1/g"
    

    Result:

    <p:commandButton  ajax="false" value="Submit" action="#{accessBean.login}" />
    

    OK

  3. $ echo "<p:commandButton ajax=\"false\" value=\"#{i.m['Submit']}\" action=\"#{accessBean.login}\" /> <td align=\"center\">#{i.m['Login']}</td>" | sed -e "s/#{i.m\['\(.*\)']}/\1/g"
    

    Result:

    <p:commandButton ajax="false" value="Submit']}" action="#{accessBean.login}" /> <td align="center">#{i.m['Login</td>
    

    NOK

I'm using Ubuntu 18.04.

Upvotes: 1

Views: 116

Answers (2)

David C. Rankin
David C. Rankin

Reputation: 84551

Per your request, and as noted in my comment and the comment of others, you should definitely use a proper XML parser like xmlstartlet for proper XHTML parsing. A simple regex has no validation for what is left behind.

That being said, for your example (only), to replace the text leaving LOGIN, PASSWORD and Submit you could use the following regex:

sed "s/[#][{]i[.]m[[][']\([^']*\)['][]][}]/\1/" <file

Whenever you have to match characters that can also be part of the regex itself, it helps to explicitly make sure the character you want to match is treated as a character and not part of the regex expression. To do that you make use of a character-class (e.g. [...] where the characters between the brackets are matched. (if the first character in the character class is '^' it will invert the match -- i.e. match everything but what is in the class)

With that explanation, the regex should become clear. The regex uses the basic substitution form:

sed "s/find/replace/" file

The 'find' REGEX

  • [#] - match the pound sign
  • [{] - match the opening brace
  • i - match the 'i'
  • [.] - explicitly match the '.' character (instead of . any character)
  • m - match the 'm'
  • [[] - match the opening bracket
  • ['] - match the single quote
  • \( - begin your capture group to capture text to reinsert as a back reference
  • [^']* - match zero-or-more characters that are not a single-quote
  • \) - end your capture group
  • ['] - match the single-quote as the next character
  • []] - match the closing bracket
  • [}] - match the closing brace.

The 'replace' REGEX

All characters captured as part of the find capture group (between the \(....\)), are available to use as a back reference in the replace portion of the substitution. You can have more than one capture group in the find portion, which you reference in the replace part of the substitution as \1, \2, ... and so on. Here you have only a single capture group in the find portion, so whatever was matched can be used as the entire replacement, e.g.

  • \1 - to replace the whole mess with just the text that was captured with [^']*

Example Use/Output

For use with your example, it will properly leave Login, Password and Submit as indicated in your question, e.g.

sed "s/[#][{]i[.]m[[][']\([^']*\)['][]][}]/\1/" file
<h:panelGroup rendered="#{not accessBean.isUserLoggedIn}">
    <h:form>
        <p:panel style="margin-top:10px">
            <table style="margin:10px">
                <tbody>
                    <tr>
                        <td align="center">Login</td>
                        <td align="center">
                            <h:inputText value="#{accessBean.login}" />
                        </td>
                    </tr>
                    <tr>
                        <td align="center">Password</td>
                        <td align="center">
                            <h:inputSecret value="#{accessBean.password}" />
                        </td>
                    </tr>
                </tbody>
            </table>
            <p:commandButton ajax="false" value="Submit" action="#{accessBean.login}" />
        </p:panel>
    </h:form>
</h:panelGroup>

Again, as a disclaimer and just good common sense, don't parse X/HTML with a regex, use a proper tool like xmlstartlet. Don't parse JSON with a regex, use a proper tools for the job like jq -- you get the drift. (but for this limited example, the regex works well, but it is fragile, if anything in the input changes, it will break -- which is why we have tools like xmlstartlet and jq)

Upvotes: 1

Michael Vehrs
Michael Vehrs

Reputation: 3363

The problem here is that you do not take the greedy nature of regexps into account. You need to prevent your regexp from gobbling up extra 's:

sed -e "s/#{i.m['([^']*)']}/\1/g"

This is also the reason why David C. Rankin's solution works. His regexp is unnecessarily complex, however.

Upvotes: 1

Related Questions