Carlos Escalera Alonso
Carlos Escalera Alonso

Reputation: 2363

Get all strings between a specific tag in an unformatted XML file with a batch file

I'm trying to get the strings between 2 tags in an XML file adapting a solution I found in here.

This is the batch file I've:

@echo off
setlocal EnableDelayedExpansion

(for /F "delims=" %%a in ('findstr /I /L "<Name>" contacts.xml') do (
   set "line=%%a
   set "line=!line:*<Name>=!"
   for /F "delims=<" %%b in ("!line!") do echo %%b
)) > list.txt

Now when the XML is formatted I get all the names

<List>
   <Contacts>
      <Row>
         <Name>Carlos</Name>
         <Path>\Some\path\1</Path>
         <Hidden>False</Hidden>
      </Row>
      <Row>
         <Name>Fernando</Name>
         <Path>\Some\path\2</Path>
         <Hidden>False</Hidden>
      </Row>
      <Row>
         <Name>Luis</Name>
         <Path>\Some\path\3</Path>
         <Hidden>False</Hidden>
      </Row>
      <Row>
         <Name>Daniel</Name>
         <Path>\Some\path\4</Path>
         <Hidden>False</Hidden>
      </Row>
   </Contacts>
</List>

Carlos

Fernando

Luis

Daniel

But when the XML(This is how it's generated) is in 1 line I only get the first name

<List><Contacts><Row><Name>Carlos</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Fernando</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Luis</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Daniel</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row></Contacts></List>

Carlos

What changes should I make to the batch file so it correctly parse unformatted XML files?

Upvotes: 1

Views: 2727

Answers (3)

Aacini
Aacini

Reputation: 67216

Batch files are strongly tied to the format of the data to process. If the data changes, usually a new Batch file is required. The pure Batch file below extract the names of your example unformatted xml file as long as the line be less than 8190 characters.

@echo off
setlocal EnableDelayedExpansion

for /F "delims=" %%a in (contacts.xml) do (
   set "line=%%a"
   for %%X in (^"^
% Do NOT remove this line %
^") do for /F "delims=" %%b in ("!line:>=%%~X!") do (
      if /I "!field!" equ "<Name" for /F "delims=<" %%c in ("%%b") do echo %%c
      set "field=%%b"
   )
)

EDIT: Some explanations added

This solution uses an interesting trick that consist in replace a character in a string by a line feed (ASCII 10) character and then pass the result into a for /F command. In this way, the parts of the original string delimited by such a char are processed as individual lines.

This is the simplest example of such a method:

@echo off
setlocal EnableDelayedExpansion

set "line=Line one|Line two|Line three|Line four"

for %%X in (^"^
% Do NOT remove this line %
^") do for /F "delims=" %%b in ("!line:|=%%~X!") do echo %%b
)

The first for %%X is the way to assign a Line Feed character into %%X replaceable parameter. After that, !line:|=%%~X! part is used to replace each | character by a line feed. Finally, the second for /F command process the resulting lines in the usual way.

Upvotes: 2

dbenham
dbenham

Reputation: 130819

As Adriano implied in his comment, parsing XML via a powerful tool like regular expressions is frowned upon. Parsing XML with batch is far worse.

Pure, native batch cannot work with lines of text longer than 8191 bytes unless you use extraordinary techniques involving the FC command - trust me, you don't want to go there. There is no reason to expect an XML file to be smaller than 8191 bytes, so the short answer is essentially - you cannot parse unformatted XML that exists as one continuous line using native batch commands.

I have written a script based regular expression utility for batch called JREPL.BAT. It is a hybrid JScript/batch script that runs natively on any Windows machine from XP onward. I recommend putting JREPL.BAT in a folder (I use c:\utils) and then include that folder in your PATH variable.

The following JREPL.BAT command can be used to parse out your names under most simple scenarios, assuming you never have nested <Name> elements. But like any regular expression "solution", this code is not robust for all situations.

jrepl "<Name>([\s\S]*?)</Name>" "$1" /m /jmatch /f "contacts.xml" /o "list.txt"

Since JREPL is a batch script, then you must use CALL JREPL if you want to use the command within another batch script.

Upvotes: 4

rojo
rojo

Reputation: 24466

Before I answer, I should point out that your single-line XML is missing a </Row> close tag, and all <Name> elements contain Carlos. So, in testing my answer, I used the following XML:

<List><Contacts><Row><Name>Carlos</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Fernando</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Luis</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Daniel</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row></Contacts></List>

Whenever you're manipulating or extracting data from XML or HTML, I think it's generally preferable to parse it as XML or HTML, rather than trying to scrape bits of text from it. Regardless of whether your XML is beautified or minified, if you parse XML as XML, your code still works. The same can't be said for regexp or token searches.

Pure batch doesn't handle XML all that well. But Windows Scripting Host does. Your best bet would be to employ JScript or VBscript, or possibly PowerShell. My solution is a batch + JScript hybrid script, employing the Microsoft.XMLDOM COM object and an XPath query to select the text child nodes of all the <Name> nodes -- basically, selectNodes('//Name/text()').

Save this with a .bat extension and salt to taste.

@if (@CodeSection == @Batch) @then

@echo off
setlocal

set "xmlfile=test.xml"

for /f "delims=" %%I in ('cscript /nologo /e:JScript "%~f0" "%xmlfile%"') do (
    echo Name: %%~I
)

rem // end main runtime
goto :EOF

@end
// end batch / begin JScript chimera

var DOM = WSH.CreateObject('Microsoft.XMLDOM');

with (DOM) {
    load(WSH.Arguments(0));
    async = false;
    setProperty('SelectionLanguage', 'XPath');
}

if (DOM.parseError.errorCode) {
   WSH.Echo(DOM.parseError.reason);
   WSH.Quit(1);
}

for (var d = DOM.documentElement.selectNodes('//Name/text()'), i = 0; i < d.length; i++) {
    WSH.Echo(d[i].data);
}

Upvotes: 3

Related Questions