Reputation: 2968
I am trying to use an AWS Glue classifier to discover the schema for a set of XML files. I have the file stored in an s3 bucket like so:
s3://bucket/name_of_dataset/dataset.xml
There is only one xml file per dataset, so no partitioning. I routinely pull these into spark using spark-xml by simply specifying the rowtag. However, when I try to do something similar in AWS glue by using an XML classifier, the dataset ends up in the Glue Catalog as "unknown" classification. One dataset shows up (each xml dataset has a different schema), but the schema seems to "discover" a nested rowtag and not the rowtag I specified.
To be more concrete, if I store this file at s3://mybucket/experiment/experiment.xml
, what should I specify as the rowtag (which appears to be the only argument)? Is there a better place to go for support?
<?xml version="1.0" encoding="UTF-8"?>
<EXPERIMENT_SET>
<EXPERIMENT xmlns="" alias="GSM1627835" accession="SRX913316" center_name="GEO">
<IDENTIFIERS>
<PRIMARY_ID>SRX913316</PRIMARY_ID>
<SUBMITTER_ID namespace="GEO">GSM1627835</SUBMITTER_ID>
</IDENTIFIERS>
<TITLE>GSM1627835: Human_normal_blsatoyst_MethylC-seq_1; Homo sapiens; Bisulfite-Seq</TITLE>
<STUDY_REF accession="SRP064113">
<IDENTIFIERS>
<PRIMARY_ID>SRP064113</PRIMARY_ID>
<EXTERNAL_ID namespace="BioProject">PRJNA296788</EXTERNAL_ID>
</IDENTIFIERS>
</STUDY_REF>
<DESIGN>
<DESIGN_DESCRIPTION/>
<SAMPLE_DESCRIPTOR accession="SRS868521">
<IDENTIFIERS>
...
Thanks in advance.
Upvotes: 1
Views: 4151
Reputation: 66
We had a similar issue with our XML source that we worked with the AWS technical support. It looks like there is a bug with the XML Crawler where, if there is an XML value that is empty (in the example you have given, the value for xmlns is ""), the Crawler seems to skip the classifer you have defined and defaults to a row tag that is most likely from a nested row in the XML.
They are working towards a fix for the same and it is likely to be released this week or next.
Hope this helps.
Upvotes: 1