Reputation: 2363
I'm trying to get the strings between 2 tags in an XML file adapting a solution I found in here.
This is the batch file I've:
@echo off
setlocal EnableDelayedExpansion
(for /F "delims=" %%a in ('findstr /I /L "<Name>" contacts.xml') do (
set "line=%%a
set "line=!line:*<Name>=!"
for /F "delims=<" %%b in ("!line!") do echo %%b
)) > list.txt
Now when the XML is formatted I get all the names
<List>
<Contacts>
<Row>
<Name>Carlos</Name>
<Path>\Some\path\1</Path>
<Hidden>False</Hidden>
</Row>
<Row>
<Name>Fernando</Name>
<Path>\Some\path\2</Path>
<Hidden>False</Hidden>
</Row>
<Row>
<Name>Luis</Name>
<Path>\Some\path\3</Path>
<Hidden>False</Hidden>
</Row>
<Row>
<Name>Daniel</Name>
<Path>\Some\path\4</Path>
<Hidden>False</Hidden>
</Row>
</Contacts>
</List>
Carlos
Fernando
Luis
Daniel
But when the XML(This is how it's generated) is in 1 line I only get the first name
<List><Contacts><Row><Name>Carlos</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Fernando</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Luis</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Daniel</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row></Contacts></List>
Carlos
What changes should I make to the batch file so it correctly parse unformatted XML files?
Upvotes: 1
Views: 2727
Reputation: 67216
Batch files are strongly tied to the format of the data to process. If the data changes, usually a new Batch file is required. The pure Batch file below extract the names of your example unformatted xml file as long as the line be less than 8190 characters.
@echo off
setlocal EnableDelayedExpansion
for /F "delims=" %%a in (contacts.xml) do (
set "line=%%a"
for %%X in (^"^
% Do NOT remove this line %
^") do for /F "delims=" %%b in ("!line:>=%%~X!") do (
if /I "!field!" equ "<Name" for /F "delims=<" %%c in ("%%b") do echo %%c
set "field=%%b"
)
)
EDIT: Some explanations added
This solution uses an interesting trick that consist in replace a character in a string by a line feed (ASCII 10) character and then pass the result into a for /F
command. In this way, the parts of the original string delimited by such a char are processed as individual lines.
This is the simplest example of such a method:
@echo off
setlocal EnableDelayedExpansion
set "line=Line one|Line two|Line three|Line four"
for %%X in (^"^
% Do NOT remove this line %
^") do for /F "delims=" %%b in ("!line:|=%%~X!") do echo %%b
)
The first for %%X
is the way to assign a Line Feed character into %%X
replaceable parameter. After that, !line:|=%%~X!
part is used to replace each |
character by a line feed. Finally, the second for /F
command process the resulting lines in the usual way.
Upvotes: 2
Reputation: 130819
As Adriano implied in his comment, parsing XML via a powerful tool like regular expressions is frowned upon. Parsing XML with batch is far worse.
Pure, native batch cannot work with lines of text longer than 8191 bytes unless you use extraordinary techniques involving the FC command - trust me, you don't want to go there. There is no reason to expect an XML file to be smaller than 8191 bytes, so the short answer is essentially - you cannot parse unformatted XML that exists as one continuous line using native batch commands.
I have written a script based regular expression utility for batch called JREPL.BAT. It is a hybrid JScript/batch script that runs natively on any Windows machine from XP onward. I recommend putting JREPL.BAT in a folder (I use c:\utils) and then include that folder in your PATH variable.
The following JREPL.BAT command can be used to parse out your names under most simple scenarios, assuming you never have nested <Name>
elements. But like any regular expression "solution", this code is not robust for all situations.
jrepl "<Name>([\s\S]*?)</Name>" "$1" /m /jmatch /f "contacts.xml" /o "list.txt"
Since JREPL is a batch script, then you must use CALL JREPL if you want to use the command within another batch script.
Upvotes: 4
Reputation: 24466
Before I answer, I should point out that your single-line XML is missing a </Row>
close tag, and all <Name>
elements contain Carlos
. So, in testing my answer, I used the following XML:
<List><Contacts><Row><Name>Carlos</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Fernando</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Luis</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Daniel</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row></Contacts></List>
Whenever you're manipulating or extracting data from XML or HTML, I think it's generally preferable to parse it as XML or HTML, rather than trying to scrape bits of text from it. Regardless of whether your XML is beautified or minified, if you parse XML as XML, your code still works. The same can't be said for regexp or token searches.
Pure batch doesn't handle XML all that well. But Windows Scripting Host does. Your best bet would be to employ JScript or VBscript, or possibly PowerShell. My solution is a batch + JScript hybrid script, employing the Microsoft.XMLDOM
COM object and an XPath query to select the text child nodes of all the <Name>
nodes -- basically, selectNodes('//Name/text()')
.
Save this with a .bat extension and salt to taste.
@if (@CodeSection == @Batch) @then
@echo off
setlocal
set "xmlfile=test.xml"
for /f "delims=" %%I in ('cscript /nologo /e:JScript "%~f0" "%xmlfile%"') do (
echo Name: %%~I
)
rem // end main runtime
goto :EOF
@end
// end batch / begin JScript chimera
var DOM = WSH.CreateObject('Microsoft.XMLDOM');
with (DOM) {
load(WSH.Arguments(0));
async = false;
setProperty('SelectionLanguage', 'XPath');
}
if (DOM.parseError.errorCode) {
WSH.Echo(DOM.parseError.reason);
WSH.Quit(1);
}
for (var d = DOM.documentElement.selectNodes('//Name/text()'), i = 0; i < d.length; i++) {
WSH.Echo(d[i].data);
}
Upvotes: 3