How can I use the AWS Glue XML classifier?

Question

I am trying to use an AWS Glue classifier to discover the schema for a set of XML files. I have the file stored in an s3 bucket like so:

s3://bucket/name_of_dataset/dataset.xml

There is only one xml file per dataset, so no partitioning. I routinely pull these into spark using spark-xml by simply specifying the rowtag. However, when I try to do something similar in AWS glue by using an XML classifier, the dataset ends up in the Glue Catalog as "unknown" classification. One dataset shows up (each xml dataset has a different schema), but the schema seems to "discover" a nested rowtag and not the rowtag I specified.

To be more concrete, if I store this file at s3://mybucket/experiment/experiment.xml, what should I specify as the rowtag (which appears to be the only argument)? Is there a better place to go for support?



  
    
      SRX913316
      GSM1627835
    
    GSM1627835: Human_normal_blsatoyst_MethylC-seq_1; Homo sapiens; Bisulfite-Seq
    
      
        SRP064113
        PRJNA296788
      
    
    
      
      
        
...

Thanks in advance.

Sundar · Accepted Answer

We had a similar issue with our XML source that we worked with the AWS technical support. It looks like there is a bug with the XML Crawler where, if there is an XML value that is empty (in the example you have given, the value for xmlns is ""), the Crawler seems to skip the classifer you have defined and defaults to a row tag that is most likely from a nested row in the XML.

They are working towards a fix for the same and it is likely to be released this week or next.

Hope this helps.

How can I use the AWS Glue XML classifier?

Answers (1)

Related Questions