Tayyab Anwar
Tayyab Anwar

Reputation: 319

ADLA XMLExtractor can't read properties?

I have been using the sample XMLExtractor (cloned from https://github.com/Azure/usql/tree/master/Examples/DataFormats) to extract a property from my xml element.

The extractor fails to work if the root element has any defined properties.

For example, I need to get the "sTime" property of "rec" element from the following XML file:

<lics xmlns="***" lVer="*" pID="*" aKey="*" cTime="*" gDel="*" country="*" fStr="*">
   <rec Ver="*" hID="*.*.*" cSID="Y5/*=" uID="*\Rad.*" uSID="*/*=" cAttrs="*" sTime="*" eTime="*" projID="*" docID="*" imsID="*">
   </rec>
</lics>

with the following U-SQL script:

@e = EXTRACT a string, b string
 FROM @"D:\file.xml"
 USING new Microsoft.Analytics.Samples.Formats.Xml.XmlDomExtractor(rowPath:"rec",
                         columnPaths:new SQL.MAP<string, string> { {"@sTime", "a"} });

OUTPUT @e TO "D:/output.csv" USING Outputters.Csv(quoting:false);

This writes an empty file. But if i remove the properties of the "lics" tag, it works.

<lics>
   <rec Ver="*" hID="*.*.*" cSID="Y5/*=" uID="*\Rad.*" uSID="*/*=" cAttrs="*" sTime="*" eTime="*" projID="*" docID="*" imsID="*">
   </rec>
</lics>

Is this a problem with the extractor? Or does this need to be defined in any of the parameters of the extractor?

Upvotes: 2

Views: 425

Answers (2)

Michael Rys
Michael Rys

Reputation: 6684

I would probably use another SQL.MAP to define the prefix to namespace mapping (and not require the same prefix as in the document).

I created a feature request here: https://feedback.azure.com/forums/327234-data-lake/suggestions/11675604-add-xml-namespace-support-to-xml-extractor. Please add your vote to it.

UPDATE: The XmlDomExtractor now supports XML Namespaces. Use the following USING clause:

 USING new Microsoft.Analytics.Samples.Formats.Xml.XmlDomExtractor(rowPath:"ns:rec",
                     columnPaths:new SQL.MAP<string, string> { {"@sTime", "a"} },
                     namespaceDecls: new SqlMap<string,string>{{"ns","***"}});

Upvotes: 1

Tomalak
Tomalak

Reputation: 338326

The problem is that the Microsoft.Analytics.Samples.Formats.Xml.XmlDomExtractor completely ignores XML namespaces.

A better implementation would look like this (untested, though):

[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class XmlDomExtractorNs : IExtractor
{
    private string rowPath;
    private SqlMap<string, string> columnPaths;
    private string namespaces;
    private Regex xmlns = new Regex("(?:xmlns:)?(\\S+)\\s*=\\s*([\"']?)(\\S+)\\2");

    public XmlDomExtractor(string rowPath, SqlMap<string, string> columnPaths, string namespaces)
    {
        this.rowPath = rowPath;
        this.columnPaths = columnPaths;
        this.namespaces = namespaces;
    }

    public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
    {
        IColumn column = output.Schema.FirstOrDefault(col => col.Type != typeof(string));
        if (column != null)
        {
            throw new ArgumentException(string.Format("Column '{0}' must be of type 'string', not '{1}'", column.Name, column.Type.Name));
        }

        XmlDocument xmlDocument = new XmlDocument();
        xmlDocument.Load(input.BaseStream);

        XmlNamespaceManager nsmgr = new XmlNamespaceManager(xmlDocument.NameTable);
        if (this.namespaces != null)
        {
            foreach (Match nsdef in xmlns.Matches(this.namespaces))
            {
                string prefix = nsdef.Groups[1].Value;
                string uri = nsdef.Groups[3].Value;
                nsmgr.AddNamespace(prefix, uri);
            }
        }

        foreach (XmlNode xmlNode in xmlDocument.DocumentElement.SelectNodes(this.rowPath, nsmgr))
        {
            foreach(IColumn col in output.Schema)
            {
                var explicitColumnMapping = this.columnPaths.FirstOrDefault(columnPath => columnPath.Value == col.Name);
                XmlNode xml = xmlNode.SelectSingleNode(explicitColumnMapping.Key ?? col.Name, nsmgr);
                output.Set(explicitColumnMapping.Value ?? col.Name, xml == null ? null : xml.InnerXml);
            }
            yield return output.AsReadOnly();
        }
    }
}

and used like this:

@e = EXTRACT a string, b string
  FROM @"D:\file.xml"
  USING new Your.Namespace.XmlDomExtractorNs(
    rowPath:"lics:rec",
    columnPaths:new SQL.MAP<string, string> { {"@sTime", "a"} },
    namespaces:"lics=http://the/namespace/of/the/doc"
  );

OUTPUT @e TO "D:/output.csv" USING Outputters.Csv(quoting:false);

The namespaces argument will be parsed into namespace-prefix and namespace-uri parts which will then be used to drive the XPath queries. For convenience it supports any of these value formats:

  • 'xmlns:foo="http://uri/1" xmlns:bar="http://uri/2"'
  • "xmlns:foo='http://uri/1' xmlns:bar='http://uri/2'"
  • "xmlns:foo=http://uri/1 xmlns:bar=http://uri/2"
  • "foo=http://uri/1 bar=http://uri/2"

so it accommodates copying them directly from the XML source as well as creating them manually without too much fuss.

Since the XML document you use has a default namespace and XPath mandates the use of prefixes for any namespace you need in the expression, you must choose a namespace prefix for your namespace URI. I chose to use lics above.


FWIW, the regex that parses the namespaces argument breaks down as follows:

(?:            # non-capturing group
  xmlns:       #   literal "xmlns:"
)?             # end non-capturing group, make optional
(\S+)          # GROUP 1 (prefix): any number of non-whitespace characters
\s*=\s*        # a literal "=" optionally surrounded by whitespace
(["']?)        # GROUP 2 (delimiter): either single or double quote, optional
(\S+)          # GROUP 3 (uri): any number of non-whitespace characters
\2             # whatever was in group 2 to end the namespace URI

Upvotes: 3

Related Questions