Reputation: 287690
I have a lot of XML files and I'd like to generate a report from them. The report should provide information such as:
root 100%
a*1 90%
b*1 80%
c*5 40%
meaning that all documents have a root element, 90% have one a element in the root, 80% have one b element in the root, 40% have 5 c elements in b.
If for example some documents have 4 c elements, some 5 and some 6, it should say something like:
c*4.3 4 6 40%
meaning that 40% have between 4 and 6 c elements there, and the average is 4.3.
I am looking for free software, if it doesn't exist I'll write it. I was about to do it, but I thought about checking it. I may not be the first one to have to analyze and get an structural overview of thousand of XML files.
Upvotes: 3
Views: 799
Reputation: 1327004
Here is a possible solution in ruby to this code-challenge...
Since it is my very first ruby program, I am sure it is quite terribly coded, but at least it may answer J. Pablo Fernandez's question.
Copy-paste it in a '.rb file and calls ruby on it. If you have an Internet connection, it will work ;)
require "rexml/document"
require "net/http"
require "iconv"
include REXML
class NodeAnalyzer
@@fullPathToFilesToSubNodesNamesToCardinalities = Hash.new()
@@fullPathsToFiles = Hash.new() #list of files in which a fullPath node is detected
@@fullPaths = Array.new # all fullpaths sorted alphabetically
attr_reader :name, :father, :subNodesAnalyzers, :indent, :file, :subNodesNamesToCardinalities
def initialize(aName="", aFather=nil, aFile="")
@name = aName; @father = aFather; @subNodesAnalyzers = []; @file = aFile
@subNodesNamesToCardinalities = Hash.new(0)
if aFather && !aFather.name.empty? then @indent = " " else @indent = "" end
if aFather
@indent = @father.indent + self.indent
@father.subNodesAnalyzers << self
@father.updateSubNodesNamesToCardinalities(@name)
end
end
@@nodesRootAnalyzer = NodeAnalyzer.new
def NodeAnalyzer.nodesRootAnalyzer
return @@nodesRootAnalyzer
end
def updateSubNodesNamesToCardinalities(aSubNodeName)
aSubNodeCardinality = @subNodesNamesToCardinalities[aSubNodeName]
@subNodesNamesToCardinalities[aSubNodeName] = aSubNodeCardinality + 1
end
def NodeAnalyzer.recordNode(aNodeAnalyzer)
if aNodeAnalyzer.fullNodePath.empty? == false
if @@fullPaths.include?(aNodeAnalyzer.fullNodePath) == false then @@fullPaths << aNodeAnalyzer.fullNodePath end
# record a full path in regard to its xml file (records it only one for a given xlm file)
someFiles = @@fullPathsToFiles[aNodeAnalyzer.fullNodePath]
if someFiles == nil
someFiles = Array.new(); @@fullPathsToFiles[aNodeAnalyzer.fullNodePath] = someFiles;
end
if !someFiles.include?(aNodeAnalyzer.file) then someFiles << aNodeAnalyzer.file end
end
#record cardinalties of sub nodes for a given xml file
someFilesToSubNodesNamesToCardinalities = @@fullPathToFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.fullNodePath]
if someFilesToSubNodesNamesToCardinalities == nil
someFilesToSubNodesNamesToCardinalities = Hash.new(); @@fullPathToFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.fullNodePath] = someFilesToSubNodesNamesToCardinalities ;
end
someSubNodesNamesToCardinalities = someFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.file]
if someSubNodesNamesToCardinalities == nil
someSubNodesNamesToCardinalities = Hash.new(0); someFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.file] = someSubNodesNamesToCardinalities
someSubNodesNamesToCardinalities.update(aNodeAnalyzer.subNodesNamesToCardinalities)
else
aNodeAnalyzer.subNodesNamesToCardinalities.each() do |aSubNodeName, aCardinality|
someSubNodesNamesToCardinalities[aSubNodeName] = someSubNodesNamesToCardinalities[aSubNodeName] + aCardinality
end
end
#puts "someSubNodesNamesToCardinalities for #{aNodeAnalyzer.fullNodePath}: #{someSubNodesNamesToCardinalities}"
end
def file
#if @file.empty? then @father.file else return @file end
if @file.empty? then if @father != nil then return @father.file else return '' end else return @file end
end
def fullNodePath
if @father == nil then return '' elsif @father.name.empty? then return @name else return @father.fullNodePath+"/"+@name end
end
def to_s
s = ""
if @name.empty? == false
s = "#{@indent}#{self.fullNodePath} [#{self.file}]\n"
end
@subNodesAnalyzers.each() do |aSubNodeAnalyzer|
s = s + aSubNodeAnalyzer.to_s
end
return s
end
def NodeAnalyzer.displayStats(aFullPath="")
s = "";
if aFullPath.empty? then s = "Statistical Elements Analysis of #{@@nodesRootAnalyzer.subNodesAnalyzers.length} xml documents with #{@@fullPaths.length} elements\n" end
someFullPaths = @@fullPaths.sort
someFullPaths.each do |aFullPath|
s = s + getIndentedNameFromFullPath(aFullPath) + "*"
nbFilesWithThatFullPath = getNbFilesWithThatFullPath(aFullPath);
aParentFullPath = getParentFullPath(aFullPath)
nbFilesWithParentFullPath = getNbFilesWithThatFullPath(aParentFullPath);
aNameFromFullPath = getNameFromFullPath(aFullPath)
someFilesToSubNodesNamesToCardinalities = @@fullPathToFilesToSubNodesNamesToCardinalities[aParentFullPath]
someCardinalities = Array.new()
someFilesToSubNodesNamesToCardinalities.each() do |aFile, someSubNodesNamesToCardinalities|
aCardinality = someSubNodesNamesToCardinalities[aNameFromFullPath]
if aCardinality > 0 && someCardinalities.include?(aCardinality) == false then someCardinalities << aCardinality end
end
if someCardinalities.length == 1
s = s + someCardinalities.to_s + " "
else
anAvg = someCardinalities.inject(0) {|sum,value| Float(sum) + Float(value) } / Float(someCardinalities.length)
s = s + sprintf('%.1f', anAvg) + " " + someCardinalities.min.to_s + "..." + someCardinalities.max.to_s + " "
end
s = s + sprintf('%d', Float(nbFilesWithThatFullPath) / Float(nbFilesWithParentFullPath) * 100) + '%'
s = s + "\n"
end
return s
end
def NodeAnalyzer.getNameFromFullPath(aFullPath)
if aFullPath.include?("/") == false then return aFullPath end
aNameFromFullPath = aFullPath.dup
aNameFromFullPath[/^(?:[^\/]+\/)+/] = ""
return aNameFromFullPath
end
def NodeAnalyzer.getIndentedNameFromFullPath(aFullPath)
if aFullPath.include?("/") == false then return aFullPath end
anIndentedNameFromFullPath = aFullPath.dup
anIndentedNameFromFullPath = anIndentedNameFromFullPath.gsub(/[^\/]+\//, " ")
return anIndentedNameFromFullPath
end
def NodeAnalyzer.getParentFullPath(aFullPath)
if aFullPath.include?("/") == false then return "" end
aParentFullPath = aFullPath.dup
aParentFullPath[/\/[^\/]+$/] = ""
return aParentFullPath
end
def NodeAnalyzer.getNbFilesWithThatFullPath(aFullPath)
if aFullPath.empty?
return @@nodesRootAnalyzer.subNodesAnalyzers.length
else
return @@fullPathsToFiles[aFullPath].length;
end
end
end
class REXML::Document
def analyze(node, aFatherNodeAnalyzer, aFile="")
anNodeAnalyzer = NodeAnalyzer.new(node.name, aFatherNodeAnalyzer, aFile)
node.elements.each() do |aSubNode| analyze(aSubNode, anNodeAnalyzer) end
NodeAnalyzer.recordNode(anNodeAnalyzer)
end
end
begin
anXmlFilesDirectory = "xmlfiles.com/examples/"
anXmlFilesRegExp = Regexp.new("http:\/\/" + anXmlFilesDirectory + "([^\"]*)")
a = Net::HTTP.get(URI("http://www.google.fr/search?q=site:"+anXmlFilesDirectory+"+filetype:xml&num=100&as_qdr=all&filter=0"))
someXmlFiles = a.scan(anXmlFilesRegExp)
someXmlFiles.each() do |anXmlFile|
anXmlFileContent = Net::HTTP.get(URI("http://" + anXmlFilesDirectory + anXmlFile.to_s))
anUTF8XmlFileContent = Iconv.conv("ISO-8859-1//ignore", 'UTF-8', anXmlFileContent).gsub(/\s+encoding\s*=\s*\"[^\"]+\"\s*\?/,"?")
anXmlDocument = Document.new(anUTF8XmlFileContent)
puts "Analyzing #{anXmlFile}: #{NodeAnalyzer.nodesRootAnalyzer.name}"
anXmlDocument.analyze(anXmlDocument.root,NodeAnalyzer.nodesRootAnalyzer, anXmlFile.to_s)
end
NodeAnalyzer.recordNode(NodeAnalyzer.nodesRootAnalyzer)
puts NodeAnalyzer.displayStats
end
Upvotes: 0
Reputation: 3690
Here's an XSLT 2.0 method.
Assuming that $docs
contains a sequence of document nodes that you want to scan, you want to create one line for each element that appears in the documents. You can use <xsl:for-each-group>
to do that:
<xsl:for-each-group select="$docs//*" group-by="name()">
<xsl:sort select="current-group-key()" />
<xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
<xsl:value-of select="$name" />
...
</xsl:for-each-group>
Then you want to find out the stats for that element amongst the documents. First, find the documents have an element of that name in them:
<xsl:variable name="docs-with" as="document-node()+"
select="$docs[//*[name() = $name]" />
Second, you need a sequence of the number of elements of that name in each of the documents:
<xsl:variable name="elem-counts" as="xs:integer+"
select="$docs-with/count(//*[name() = $name])" />
And now you can do the calculations. Average, minimum and maximum can be calculated with the avg()
, min()
and max()
functions. The percentage is simply the number of documents that contain the element divided by the total number of documents, formatted.
Putting that together:
<xsl:for-each-group select="$docs//*" group-by="name()">
<xsl:sort select="current-group-key()" />
<xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
<xsl:variable name="docs-with" as="document-node()+"
select="$docs[//*[name() = $name]" />
<xsl:variable name="elem-counts" as="xs:integer+"
select="$docs-with/count(//*[name() = $name])" />
<xsl:value-of select="$name" />
<xsl:text>* </xsl:text>
<xsl:value-of select="format-number(avg($elem-counts), '#,##0.0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number(min($elem-counts), '#,##0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number(max($elem-counts), '#,##0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number((count($docs-with) div count($docs)) * 100, '#0')" />
<xsl:text>%</xsl:text>
<xsl:text>
</xsl:text>
</xsl:for-each-group>
What I haven't done here is indented the lines according to the depth of the element. I've just ordered the elements alphabetically to give you statistics. Two reasons for that: first, it's significantly harder (like too involved to write here) to display the element statistics in some kind of structure that reflects how they appear in the documents, not least because different documents may have different structures. Second, in many markup languages, the precise structure of the documents can't be known (because, for example, sections can nest within sections to any depth).
I hope it's useful none the less.
UPDATE:
Need the XSLT wrapper and some instructions for running XSLT? OK. First, get your hands on Saxon 9B.
You'll need to put all the files you want to analyse in a directory. Saxon allows you to access all the files in that directory (or its subdirectories) using a collection using a special URI syntax. It's worth having a look at that syntax if you want to search recursively or filter the files that you're looking at by their filename.
Now the full XSLT:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs">
<xsl:param name="dir" as="xs:string"
select="'file:///path/to/default/directory?select=*.xml'" />
<xsl:output method="text" />
<xsl:variable name="docs" as="document-node()*"
select="collection($dir)" />
<xsl:template name="main">
<xsl:for-each-group select="$docs//*" group-by="name()">
<xsl:sort select="current-group-key()" />
<xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
<xsl:variable name="docs-with" as="document-node()+"
select="$docs[//*[name() = $name]" />
<xsl:variable name="elem-counts" as="xs:integer+"
select="$docs-with/count(//*[name() = $name])" />
<xsl:value-of select="$name" />
<xsl:text>* </xsl:text>
<xsl:value-of select="format-number(avg($elem-counts), '#,##0.0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number(min($elem-counts), '#,##0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number(max($elem-counts), '#,##0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number((count($docs-with) div count($docs)) * 100, '#0')" />
<xsl:text>%</xsl:text>
<xsl:text>
</xsl:text>
</xsl:for-each-group>
</xsl:template>
</xsl:stylesheet>
And to run it you would do something like:
> java -jar path/to/saxon.jar -it:main -o:report.txt dir=file:///path/to/your/directory?select=*.xml
This tells Saxon to start the process with the template named main
, to set the dir
parameter to file:///path/to/your/directory?select=*.xml
and send the output to report.txt
.
Upvotes: 11
Reputation: 10046
Go with JeniT's answer - she's one of the first XSLT guru's I started learning from back on '02. To really appreciate the power of XML you should work with XPath and XSLT and learn to manipulate the nodes.
Upvotes: 0
Reputation: 1327004
[community post, here: no karma involved;) ]
I propose a code-challenge here:
parse all xml find in xmlfiles.com/examples and try to come up with the following output:
Analyzing plant_catalog.xml:
Analyzing note.xml:
Analyzing portfolio.xml:
Analyzing note_ex_dtd.xml:
Analyzing home.xml:
Analyzing simple.xml:
Analyzing cd_catalog.xml:
Analyzing portfolio_xsl.xml:
Analyzing note_in_dtd.xml:
Statistical Elements Analysis of 9 xml documents with 34 elements
CATALOG*2 22%
CD*26 50%
ARTIST*26 100%
COMPANY*26 100%
COUNTRY*26 100%
PRICE*26 100%
TITLE*26 100%
YEAR*26 100%
PLANT*36 50%
AVAILABILITY*36 100%
BOTANICAL*36 100%
COMMON*36 100%
LIGHT*36 100%
PRICE*36 100%
ZONE*36 100%
breakfast-menu*1 11%
food*5 100%
calories*5 100%
description*5 100%
name*5 100%
price*5 100%
note*3 33%
body*1 100%
from*1 100%
heading*1 100%
to*1 100%
page*1 11%
para*1 100%
title*1 100%
portfolio*2 22%
stock*2 100%
name*2 100%
price*2 100%
symbol*2 100%
Upvotes: 0