Reputation: 287690

Program to analyze a lot of XMLs

I have a lot of XML files and I'd like to generate a report from them. The report should provide information such as:

root 100%
 a*1 90%
 b*1 80%
  c*5 40%

meaning that all documents have a root element, 90% have one a element in the root, 80% have one b element in the root, 40% have 5 c elements in b.

If for example some documents have 4 c elements, some 5 and some 6, it should say something like:

c*4.3 4 6 40%

meaning that 40% have between 4 and 6 c elements there, and the average is 4.3.

I am looking for free software, if it doesn't exist I'll write it. I was about to do it, but I thought about checking it. I may not be the first one to have to analyze and get an structural overview of thousand of XML files.

Upvotes: 3

Answers (6)

Mads Hansen

Reputation: 66783

Check out Gadget

_{(source: mit.edu)}

Upvotes: 3

VonC

Reputation: 1327004

Here is a possible solution in ruby to this code-challenge...
Since it is my very first ruby program, I am sure it is quite terribly coded, but at least it may answer J. Pablo Fernandez's question.

Copy-paste it in a '.rb file and calls ruby on it. If you have an Internet connection, it will work ;)

require "rexml/document"
require "net/http"
require "iconv"
include REXML
class NodeAnalyzer
  @@fullPathToFilesToSubNodesNamesToCardinalities = Hash.new()
  @@fullPathsToFiles = Hash.new() #list of files in which a fullPath node is detected
  @@fullPaths = Array.new # all fullpaths sorted alphabetically
  attr_reader :name, :father, :subNodesAnalyzers, :indent, :file, :subNodesNamesToCardinalities
    def initialize(aName="", aFather=nil, aFile="")
        @name = aName; @father = aFather; @subNodesAnalyzers = []; @file = aFile
    @subNodesNamesToCardinalities = Hash.new(0)
    if aFather && !aFather.name.empty? then @indent = "  " else @indent = "" end
    if aFather
      @indent = @father.indent + self.indent
      @father.subNodesAnalyzers << self
      @father.updateSubNodesNamesToCardinalities(@name)
    end
    end
  @@nodesRootAnalyzer = NodeAnalyzer.new
  def NodeAnalyzer.nodesRootAnalyzer
    return @@nodesRootAnalyzer
  end
  def updateSubNodesNamesToCardinalities(aSubNodeName)
    aSubNodeCardinality = @subNodesNamesToCardinalities[aSubNodeName]
    @subNodesNamesToCardinalities[aSubNodeName] = aSubNodeCardinality + 1
  end
  def NodeAnalyzer.recordNode(aNodeAnalyzer)
    if aNodeAnalyzer.fullNodePath.empty? == false
      if @@fullPaths.include?(aNodeAnalyzer.fullNodePath) == false then @@fullPaths << aNodeAnalyzer.fullNodePath end
      # record a full path in regard to its xml file (records it only one for a given xlm file)
      someFiles = @@fullPathsToFiles[aNodeAnalyzer.fullNodePath]
      if someFiles == nil 
        someFiles = Array.new(); @@fullPathsToFiles[aNodeAnalyzer.fullNodePath] = someFiles; 
      end
      if !someFiles.include?(aNodeAnalyzer.file) then someFiles << aNodeAnalyzer.file end
    end
    #record cardinalties of sub nodes for a given xml file
    someFilesToSubNodesNamesToCardinalities = @@fullPathToFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.fullNodePath]
    if someFilesToSubNodesNamesToCardinalities == nil 
      someFilesToSubNodesNamesToCardinalities = Hash.new(); @@fullPathToFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.fullNodePath] = someFilesToSubNodesNamesToCardinalities ; 
    end
    someSubNodesNamesToCardinalities = someFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.file]
    if someSubNodesNamesToCardinalities == nil
      someSubNodesNamesToCardinalities = Hash.new(0); someFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.file] = someSubNodesNamesToCardinalities
      someSubNodesNamesToCardinalities.update(aNodeAnalyzer.subNodesNamesToCardinalities)
    else
      aNodeAnalyzer.subNodesNamesToCardinalities.each() do |aSubNodeName, aCardinality|
        someSubNodesNamesToCardinalities[aSubNodeName] = someSubNodesNamesToCardinalities[aSubNodeName] + aCardinality
      end
    end  
    #puts "someSubNodesNamesToCardinalities for #{aNodeAnalyzer.fullNodePath}: #{someSubNodesNamesToCardinalities}"
  end
  def file
    #if @file.empty? then @father.file else return @file end
    if @file.empty? then if @father != nil then return @father.file else return '' end else return @file end
  end
  def fullNodePath
    if @father == nil then return '' elsif @father.name.empty? then return @name else return @father.fullNodePath+"/"+@name end
  end
    def to_s
    s = ""
    if @name.empty? == false
      s = "#{@indent}#{self.fullNodePath} [#{self.file}]\n"
    end
    @subNodesAnalyzers.each() do |aSubNodeAnalyzer|
      s = s + aSubNodeAnalyzer.to_s
    end
    return s
    end
  def NodeAnalyzer.displayStats(aFullPath="")
    s = "";
    if aFullPath.empty? then s = "Statistical Elements Analysis of #{@@nodesRootAnalyzer.subNodesAnalyzers.length} xml documents with #{@@fullPaths.length} elements\n" end
    someFullPaths = @@fullPaths.sort
    someFullPaths.each do |aFullPath|
      s = s + getIndentedNameFromFullPath(aFullPath) + "*"
      nbFilesWithThatFullPath = getNbFilesWithThatFullPath(aFullPath);
      aParentFullPath = getParentFullPath(aFullPath)
      nbFilesWithParentFullPath = getNbFilesWithThatFullPath(aParentFullPath);
      aNameFromFullPath = getNameFromFullPath(aFullPath)
      someFilesToSubNodesNamesToCardinalities = @@fullPathToFilesToSubNodesNamesToCardinalities[aParentFullPath]
      someCardinalities = Array.new()
      someFilesToSubNodesNamesToCardinalities.each() do |aFile, someSubNodesNamesToCardinalities|
        aCardinality = someSubNodesNamesToCardinalities[aNameFromFullPath]
        if aCardinality > 0 && someCardinalities.include?(aCardinality) == false then someCardinalities << aCardinality end
      end
      if someCardinalities.length == 1
        s = s + someCardinalities.to_s + " "
      else
        anAvg = someCardinalities.inject(0) {|sum,value| Float(sum) + Float(value) } / Float(someCardinalities.length)
        s = s + sprintf('%.1f', anAvg) + " " + someCardinalities.min.to_s + "..." + someCardinalities.max.to_s + " "
      end
      s = s + sprintf('%d', Float(nbFilesWithThatFullPath) / Float(nbFilesWithParentFullPath) * 100) + '%'
      s = s + "\n"
    end
    return s
  end
  def NodeAnalyzer.getNameFromFullPath(aFullPath)
    if aFullPath.include?("/") == false then return aFullPath end
    aNameFromFullPath = aFullPath.dup
    aNameFromFullPath[/^(?:[^\/]+\/)+/] = ""
    return aNameFromFullPath
  end
  def NodeAnalyzer.getIndentedNameFromFullPath(aFullPath)
    if aFullPath.include?("/") == false then return aFullPath end
    anIndentedNameFromFullPath = aFullPath.dup
    anIndentedNameFromFullPath = anIndentedNameFromFullPath.gsub(/[^\/]+\//, "  ")
    return anIndentedNameFromFullPath
  end
  def NodeAnalyzer.getParentFullPath(aFullPath)
    if aFullPath.include?("/") == false then return "" end
    aParentFullPath = aFullPath.dup
    aParentFullPath[/\/[^\/]+$/] = ""
    return aParentFullPath
  end
  def NodeAnalyzer.getNbFilesWithThatFullPath(aFullPath)
    if aFullPath.empty? 
      return @@nodesRootAnalyzer.subNodesAnalyzers.length
    else
      return @@fullPathsToFiles[aFullPath].length;
    end
  end
end
class REXML::Document
    def analyze(node, aFatherNodeAnalyzer, aFile="")
    anNodeAnalyzer = NodeAnalyzer.new(node.name, aFatherNodeAnalyzer, aFile)
    node.elements.each() do |aSubNode| analyze(aSubNode, anNodeAnalyzer) end
    NodeAnalyzer.recordNode(anNodeAnalyzer)
  end
end

begin
  anXmlFilesDirectory = "xmlfiles.com/examples/"
  anXmlFilesRegExp = Regexp.new("http:\/\/" + anXmlFilesDirectory + "([^\"]*)")
  a = Net::HTTP.get(URI("http://www.google.fr/search?q=site:"+anXmlFilesDirectory+"+filetype:xml&num=100&as_qdr=all&filter=0"))
  someXmlFiles = a.scan(anXmlFilesRegExp)
  someXmlFiles.each() do |anXmlFile|
    anXmlFileContent = Net::HTTP.get(URI("http://" + anXmlFilesDirectory + anXmlFile.to_s))
    anUTF8XmlFileContent = Iconv.conv("ISO-8859-1//ignore", 'UTF-8', anXmlFileContent).gsub(/\s+encoding\s*=\s*\"[^\"]+\"\s*\?/,"?")
    anXmlDocument = Document.new(anUTF8XmlFileContent)
    puts "Analyzing #{anXmlFile}: #{NodeAnalyzer.nodesRootAnalyzer.name}"
    anXmlDocument.analyze(anXmlDocument.root,NodeAnalyzer.nodesRootAnalyzer, anXmlFile.to_s)
  end
  NodeAnalyzer.recordNode(NodeAnalyzer.nodesRootAnalyzer)
  puts NodeAnalyzer.displayStats
end

Upvotes: 0

JeniT

Reputation: 3690

Here's an XSLT 2.0 method.

Assuming that $docs contains a sequence of document nodes that you want to scan, you want to create one line for each element that appears in the documents. You can use <xsl:for-each-group> to do that:

<xsl:for-each-group select="$docs//*" group-by="name()">
  <xsl:sort select="current-group-key()" />
  <xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
  <xsl:value-of select="$name" />
  ...
</xsl:for-each-group>

Then you want to find out the stats for that element amongst the documents. First, find the documents have an element of that name in them:

<xsl:variable name="docs-with" as="document-node()+"
  select="$docs[//*[name() = $name]" />

Second, you need a sequence of the number of elements of that name in each of the documents:

<xsl:variable name="elem-counts" as="xs:integer+"
  select="$docs-with/count(//*[name() = $name])" />

And now you can do the calculations. Average, minimum and maximum can be calculated with the avg(), min() and max() functions. The percentage is simply the number of documents that contain the element divided by the total number of documents, formatted.

Putting that together:

<xsl:for-each-group select="$docs//*" group-by="name()">
  <xsl:sort select="current-group-key()" />
  <xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
  <xsl:variable name="docs-with" as="document-node()+"
    select="$docs[//*[name() = $name]" />
  <xsl:variable name="elem-counts" as="xs:integer+"
    select="$docs-with/count(//*[name() = $name])" />
  <xsl:value-of select="$name" />
  <xsl:text>* </xsl:text>
  <xsl:value-of select="format-number(avg($elem-counts), '#,##0.0')" />
  <xsl:text> </xsl:text>
  <xsl:value-of select="format-number(min($elem-counts), '#,##0')" />
  <xsl:text> </xsl:text>
  <xsl:value-of select="format-number(max($elem-counts), '#,##0')" />
  <xsl:text> </xsl:text>
  <xsl:value-of select="format-number((count($docs-with) div count($docs)) * 100, '#0')" />
  <xsl:text>%</xsl:text>
  <xsl:text>&#xA;</xsl:text>
</xsl:for-each-group>

What I haven't done here is indented the lines according to the depth of the element. I've just ordered the elements alphabetically to give you statistics. Two reasons for that: first, it's significantly harder (like too involved to write here) to display the element statistics in some kind of structure that reflects how they appear in the documents, not least because different documents may have different structures. Second, in many markup languages, the precise structure of the documents can't be known (because, for example, sections can nest within sections to any depth).

I hope it's useful none the less.

UPDATE:

Need the XSLT wrapper and some instructions for running XSLT? OK. First, get your hands on Saxon 9B.

You'll need to put all the files you want to analyse in a directory. Saxon allows you to access all the files in that directory (or its subdirectories) using a collection using a special URI syntax. It's worth having a look at that syntax if you want to search recursively or filter the files that you're looking at by their filename.

Now the full XSLT:

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xs">

<xsl:param name="dir" as="xs:string"
  select="'file:///path/to/default/directory?select=*.xml'" />

<xsl:output method="text" />

<xsl:variable name="docs" as="document-node()*"
  select="collection($dir)" />

<xsl:template name="main">
  <xsl:for-each-group select="$docs//*" group-by="name()">
    <xsl:sort select="current-group-key()" />
    <xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
    <xsl:variable name="docs-with" as="document-node()+"
      select="$docs[//*[name() = $name]" />
    <xsl:variable name="elem-counts" as="xs:integer+"
      select="$docs-with/count(//*[name() = $name])" />
    <xsl:value-of select="$name" />
    <xsl:text>* </xsl:text>
    <xsl:value-of select="format-number(avg($elem-counts), '#,##0.0')" />
    <xsl:text> </xsl:text>
    <xsl:value-of select="format-number(min($elem-counts), '#,##0')" />
    <xsl:text> </xsl:text>
    <xsl:value-of select="format-number(max($elem-counts), '#,##0')" />
    <xsl:text> </xsl:text>
    <xsl:value-of select="format-number((count($docs-with) div count($docs)) * 100, '#0')" />
    <xsl:text>%</xsl:text>
    <xsl:text>&#xA;</xsl:text>
  </xsl:for-each-group>
</xsl:template> 

</xsl:stylesheet>

And to run it you would do something like:

> java -jar path/to/saxon.jar -it:main -o:report.txt dir=file:///path/to/your/directory?select=*.xml

This tells Saxon to start the process with the template named main, to set the dir parameter to file:///path/to/your/directory?select=*.xml and send the output to report.txt.

Upvotes: 11

David Robbins

Reputation: 10046

Go with JeniT's answer - she's one of the first XSLT guru's I started learning from back on '02. To really appreciate the power of XML you should work with XPath and XSLT and learn to manipulate the nodes.

Upvotes: 0

VonC

Reputation: 1327004

[community post, here: no karma involved;) ]
I propose a code-challenge here:

parse all xml find in xmlfiles.com/examples and try to come up with the following output:

Analyzing plant_catalog.xml: 
Analyzing note.xml: 
Analyzing portfolio.xml: 
Analyzing note_ex_dtd.xml: 
Analyzing home.xml: 
Analyzing simple.xml: 
Analyzing cd_catalog.xml: 
Analyzing portfolio_xsl.xml: 
Analyzing note_in_dtd.xml: 
Statistical Elements Analysis of 9 xml documents with 34 elements
CATALOG*2 22%
  CD*26 50%
    ARTIST*26 100%
    COMPANY*26 100%
    COUNTRY*26 100%
    PRICE*26 100%
    TITLE*26 100%
    YEAR*26 100%
  PLANT*36 50%
    AVAILABILITY*36 100%
    BOTANICAL*36 100%
    COMMON*36 100%
    LIGHT*36 100%
    PRICE*36 100%
    ZONE*36 100%
breakfast-menu*1 11%
  food*5 100%
    calories*5 100%
    description*5 100%
    name*5 100%
    price*5 100%
note*3 33%
  body*1 100%
  from*1 100%
  heading*1 100%
  to*1 100%
page*1 11%
  para*1 100%
  title*1 100%
portfolio*2 22%
  stock*2 100%
    name*2 100%
    price*2 100%
    symbol*2 100%

Upvotes: 0

jeremy

Reputation: 4681

Beautiful Soup makes parsing XML trivial in python.

Upvotes: 1

Program to analyze a lot of XMLs

Answers (6)

Related Questions