kieran_pli
kieran_pli

Reputation: 13

Using a bat file, Delete all but first occurences of a string in an XML

I want to remove all occurrences of strings <!-- and --> from an XML EXCEPT for the first which surround a comment that I want to keep. I do not want to delete any text enclosed by these strings. The strings all occur on different lines. I am able to delete all instances of a string(s) by using the proposals in Delete certain lines in a txt file via a batch file but am not sure of the best way (using a for loop?) of skipping the first ones.

The XML looks like this:

<?xml version="1.0"?> <!-- REVISION HISTORY and file descriptions which I want to keep commented --> <!-- some code I want to uncomment --> <!-- some more code I want to uncomment -->

Upvotes: 1

Views: 704

Answers (2)

aschipfl
aschipfl

Reputation: 34899

The original answer is below; here is much simpler approach, developed for the task at hand:

Here is a pure solution, based on the findstr command -- let us call it remove-lines.bat:

@echo off
setlocal EnableExtensions DisableDelayedExpansion

rem // Define constants here:
set "FILE=%~1"           & rem // 1st argument is the original file
set "FILE_NEW=%~2"       & rem // 2nd argument is the modified file
set "SKIP_UNTIL=-->"     & rem // don't modify lines up to 1st occurrence
set REMOVE="<^!--" "-->" & rem // no `?` and `*` allowed here!
                           rem // `%` --> `%%` & `!`  --> `^!`

if defined FILE (set FILE="%FILE%") else set "FILE="
if not defined FILE_NEW set "FILE_NEW=con"

> "%FILE_NEW%" (
    set "FLAG="
    for /F "delims=" %%L in ('findstr /N /R "^" %FILE%') do (
        set "LINE=%%L"
        setlocal EnableDelayedExpansion
        set "LINE=!LINE:*:=!"
        if defined FLAG (
            set "FOUND="
            for %%S in (!REMOVE!) do (
                echo(| set /P "=_!LINE!" | > nul findstr /L /M /C:"_%%S"
                if not ErrorLevel 1 set "FOUND=#"
            )
            if not defined FOUND echo(!LINE!
        ) else (
            echo(!LINE!
        )
        echo(| set /P "=_!LINE!" | > nul findstr /L /M /C:"_!SKIP_UNTIL!"
        if ErrorLevel 1 (endlocal) else endlocal & set "FLAG=#"
    )
)

endlocal
exit /B

Basically, the script reads the text file by the for /F %%L loop1). In the body of this loop, there is a standard for %%S loop which iterates through the strings defined by variable REMOVE. Inside of this loop, variable FOUND is set as soon as any one of the strings have been found in the current line2). After the loop, the line is returned only if FOUND is still empty, meaning that none of the strings have been found. All this searching is only accomplished in case variable FLAG is set, which is done as soon as the string in variable SKIP_UNTIL is encountered2) the first time. Since this search is done after the check of variable FLAG, the inner loop does not process the affected line itself. Every read line is returned unedited as long as FLAG is unset.

1) Such a loop ignores empty lines; to overcome that, the findstr command temporarily precedes every line with a line number, which is later removed in the body of the loop; this way empty lines are not lost.
2) If you want to force the search string to occur at the beginning or at the end of the current line, add the respective switch /B or /E to the findstr command; to force the entire line to match the search string, add the /X switch.


To use it for an XML file, say data.xml in the current directory, and to write the result into file data_new.xml at the same location, type the following command line:

"remove-lines.bat" "data.xml" "data_new.xml"

This is the original answer, describing a quite complicated approach with two scripts, one calling the other, which has been done that way as the first (sub-)script was already available (although it has been developed for something completely different):

Here is a pure solution, based on a simple but quite flexible search-and-replace script -- let us call it search+replace.bat:

@echo off
setlocal DisableDelayedExpansion

rem /* Define pairs of search/replace strings here, separated by spaces,
rem    each one in the format `"<search_string>=<replace_string>"`;
rem    the `""` are mandatory; `=` separates search from replace string;
rem    the replace string may be empty, but the search string must not;
rem    if the `=` is omitted, the whole string is taken as search string;
rem    both strings must not contain the characters `=`, `*`, `?` and `"`;
rem    the search string must not begin with `~`;
rem    exclamation marks must be escaped like `^!`;
rem    percent signs must be doubled like `%%`;
rem    the search is done in a case-insensitive manner;
rem    the replacements are done in the given order: */
set STRINGS="<^!--=" "-->="

set "FILE=%~1"
rem // provide a file by command line argument;
rem // if none is given, the console input is taken;
if defined FILE (set FILE="%FILE%") else set "FILE="

set "SKIP=%~2"
rem // provide number of lines to skip optionally;
set /A SKIP+=0

for /F "delims=" %%L in ('findstr /N /R "^" %FILE%') do (
    set "LINE=%%L"
    for /F "delims=:" %%N in ("%%L") do set "LNUM=%%N"
    setlocal EnableDelayedExpansion
    set "LINE=!LINE:*:=!"
    if !LNUM! GTR %SKIP% (
        for %%R in (!STRINGS!) do (
            if defined LINE (
                for /F "tokens=1,2 delims== eol==" %%S in ("%%~R") do (
                    set "LINE=!LINE:%%S=%%T!"
                )
            )
        )
    )
    echo(!LINE!
    endlocal
)

endlocal
exit /B

Basically, the script reads the text file by the for /F %%L loop3). In the body of this loop, there is a standard for %%R loop which iterates through the search/replace string pairs defined by the variable STRINGS. Inside of this one, each string pair is split into search and replace strings by another for /F %%S loop4). The actual string replacement is done using the standard sub-string replacement syntax -- type set /? for details.

3) Such a loop ignores empty lines; to overcome that, the findstr command temporarily precedes every line with a line number, which is later removed in the body of the loop; this way empty lines are not lost.
4) This splits the pair at the (first) = sign, the two parts are then put together again with an = sign in between; this is usually not necessary but is done though in order to avoid trouble when no = sign is given.


The STRINGS variable is adapted to your needs, so to remove the literal strings <!-- and --> (or, in other words, to replace them by empty strings) -- see the related remark on top of the script.

To use it for an XML file, say data.xml in the current directory, type the following command line:

"search+replace.bat" "data.xml" 0

The resulting text is written to the console window. To put it into a file, use redirection:

("search+replace.bat" "data.xml" 0)> "data_new.xml"

Regard that you must not specify the same file for both input and output.

The 0 (can be omitted) is an optional argument that specifies how many lines from the beginning should be excluded from being processed. These lines are returned unedited.


Removing strings from a text file may result in several empty lines, like for your sample XML data. To get rid of them, you could use the following command line (entered into command prompt):

(for /F delims^=^ eol^= %F in ('^""search+replace.bat" "data.xml" 0^"') do @echo(%F) > "data_new.xml"

To use this code snippet in a batch file, you need to double the %% signs.


Since you want to keep the first <!--/-->comment (and there are not multiple comments within a single line, according to your sample data), you could use the following script, which determines the number of the first line in data.xml containing -->, then calls search+replace.bat with the file and that line number as arguments, captures the return data of the script, removes any empty lines and writes the result to the new file data_new.xml:

@echo off
setlocal EnableExtensions DisableDelayedExpansion

rem // Define constants here:
set "FILE=data.xml"
set "FILE_NEW=data_new.xml"
set "SEEK_TEXT=-->"
set "FIRST=#" &rem (set to empty string for last occurrence)

rem // Search for the first (or last) occurrence of `%SEEK%`:
set /A LINE_NUM=0
for /F "delims=:" %%N in ('
    findstr /N /L /C:"%SEEK_TEXT%" "%FILE%"
') do (
    set "LINE_NUM=%%N"
    if defined FIRST goto :CONTINUE
)
:CONTINUE

rem // Call sub-script to search and replace (remove) strings,
rem // remove all empty lines and write result to new file:
(
    for /F delims^=^ eol^= %%F in ('
        ^""%~dp0search+replace.bat" "%FILE%" %LINE_NUM%^"
    ') do (
        echo(%%F
    )
) > "%FILE_NEW%"

endlocal
exit /B

Upvotes: 0

rojo
rojo

Reputation: 24466

The best way of handling any structured markup language (XML, HTML, JSON, etc) is to parse it with the appropriate interpreter. Hacking and scraping as flat text is inviting trouble if the formatting ever changes. Save this with a .bat extension and give it a shot.

@if (@CodeSection == @Batch) @then

@echo off
setlocal

set "infile=test.xml"
set "outfile=test.xml"
cscript /nologo /e:Jscript "%~f0" "%infile%" "%outfile%" && echo Done.

goto :EOF
@end // end batch / begin JScript

var DOM = WSH.CreateObject('Msxml2.DOMDocument.6.0'),
    args = { load: WSH.Arguments(0), save: WSH.Arguments(1) };

DOM.load(args.load);
DOM.async = false;

// sanity check the XML
if (DOM.parseError.errorCode) {
    var e = DOM.parseError;
    WSH.StdErr.WriteLine('Error in ' + args.load + ' line ' + e.line + ' char '
        + e.linepos + ':\n' + e.reason + '\n' + e.srcText);
    WSH.Quit(1);
}

var comments = DOM.documentElement.selectNodes('//comment()');

// This will delete all but the first comment.
for (var i=comments.length; --i;) {
    comments[i].parentNode.removeChild(comments[i]);
}
DOM.save(args.save);

Edit: I guess if you're working with invalid XML, then manipulating the text as flat text is probably the best solution. Here's a modified version that does this:

@if (@CodeSection == @Batch) @then

@echo off
setlocal

set "infile=test.xml"
set "outfile=test2.xml"
cscript /nologo /e:Jscript "%~f0" "%infile%" "%outfile%" && echo Done.

goto :EOF
@end // end batch / begin JScript

var args = { load: WSH.Arguments(0), save: WSH.Arguments(1) },
    fso = WSH.CreateObject('Scripting.FileSystemObject'),
    fHand = fso.OpenTextFile(args.load, 1),
    matches = 0,
    XML = fHand.ReadAll().replace(/<!--|-->/g, function(m) {
        return (matches++ > 1) ? '' : m;
    });

fHand.Close();
fHand = fso.CreateTextFile(args.save, true);
fHand.Write(XML);
fHand.Close();

Or if you prefer PowerShell, here's a Batch + PowerShell hybrid script that does the same thing using the same logic.

<# : batch portion

@echo off
setlocal

set "infile=test.xml"
set "outfile=test2.xml"
powershell "iex (${%~f0} | out-string)" && echo Done.

goto :EOF
: end Batch / begin PowerShell hybrid code #>

[regex]::replace(
    (gc $env:infile | out-string),
    "<!--|-->",
    {
        if ($matches++ -gt 1) {
            ""
        } else {
            $args[0].Value
        }
    }
) | out-file $env:outfile -force

Upvotes: 1

Related Questions