sed: cut a string within a pattern

Question

I have many XHTML files whose contents are like:


    
        
            
                
                    
                        #{i.m['Login']}
                        
                            
                        
                    
                    
                        #{i.m['Password']}

I want to replace every occurrence of #{i.m['any-string>']} with any-string, i.e., cut the string within the pattern.

I have created the following sed command

sed -e "s/#{i.m\['$.*$']}/\1/g"

And to run it recursively within a directory I could execute

find . -iname '*.xhtml' -type f -exec sed -i -e "s/#{i.m\['$.*$']}/\1/g" {} \;

~~Here, the any-string can be any human-readable HTML displayable character, i.e, alphabet, numbers, other characters etc. That's why I have used regex (.*).~~

But it seems to be not working perfectly.

Here are some tests I made using echo:

$ echo "#{i.m['Login']}" | sed -e "s/#{i.m\['$.*$']}/\1/g"

Result:

Login

OK

$ echo "" | sed -e "s/#{i.m\['$.*$']}/\1/g"

Result:

OK

$ echo " #{i.m['Login']}" | sed -e "s/#{i.m\['$.*$']}/\1/g"

Result:

 #{i.m['Login

NOK

I'm using Ubuntu 18.04.

David C. Rankin · Accepted Answer

Per your request, and as noted in my comment and the comment of others, you should definitely use a proper XML parser like xmlstartlet for proper XHTML parsing. A simple regex has no validation for what is left behind.

That being said, for your example (only), to replace the text leaving LOGIN, PASSWORD and Submit you could use the following regex:

sed "s/[#][{]i[.]m[[][']$[^']*$['][]][}]/\1/"



Whenever you have to match characters that can also be part of the regex itself, it helps to explicitly make sure the character you want to match is treated as a character and not part of the regex expression. To do that you make use of a character-class (e.g. [...] where the characters between the brackets are matched. (if the first character in the character class is '^' it will invert the match -- i.e. match everything but what is in the class)

With that explanation, the regex should become clear. The regex uses the basic substitution form:

sed "s/find/replace/" file


The 'find' REGEX


[#] - match the pound sign
[{] - match the opening brace
i - match the 'i'
[.] - explicitly match the '.' character (instead of . any character)
m - match the 'm'
[[] - match the opening bracket
['] - match the single quote
$ - begin your capture group to capture text to reinsert as a back reference
[^']* - match zero-or-more characters that are not a single-quote
$ - end your capture group
['] - match the single-quote as the next character
[]] - match the closing bracket
[}] - match the closing brace.


The 'replace' REGEX

All characters captured as part of the find capture group (between the $....$), are available to use as a back reference in the replace portion of the substitution. You can have more than one capture group in the find portion, which you reference in the replace part of the substitution as \1, \2, ... and so on. Here you have only a single capture group in the find portion, so whatever was matched can be used as the entire replacement, e.g.


\1 - to replace the whole mess with just the text that was captured with [^']*


Example Use/Output

For use with your example, it will properly leave Login, Password and Submit as indicated in your question, e.g.

sed "s/[#][{]i[.]m[[][']$[^']*$['][]][}]/\1/" file

    
        
            
                
                    
                        Login
                        
                            
                        
                    
                    
                        Password
                        
                            
                        
                    
                
            
            
        
    



Again, as a disclaimer and just good common sense, don't parse X/HTML with a regex, use a proper tool like xmlstartlet. Don't parse JSON with a regex, use a proper tools for the job like jq -- you get the drift. (but for this limited example, the regex works well, but it is fragile, if anything in the input changes, it will break -- which is why we have tools like xmlstartlet and jq)

sed: cut a string within a pattern

Answers (2)

Related Questions