Reputation: 679
I have email data in XML format and I am trying to load this into multiple SAS tables. The structure of the XML is not flat and has multiple levels in a hierarchy. From this XML file, I want to create several SAS tables (For example, Sender, Recipients, Attachments, Email Body and Metadata...). Obviously for one email message, there will be one sender, one email message, but any number of recipients and attachments. To do this I am currently using an XML Map file to translate the data into the tables I need.
The problem I have is that by using the xmlv2 engine with a MAP file, it seems SAS reads the XML file once for every table I want to create. This creates a problem as this doesn't scale well! For example, if I have 200GB of XML files and want to create 10 tables, I'll read 2TB of data to do this. Is there a better way to process XML files so that I only have to do one pass of the file to read all of the data out into SAS datasets?
Thanks in advance.
Upvotes: 3
Views: 575
Reputation: 46
Allocate the directory as an aggregate file location and use data step to access the files in turn in a single data step. Search on filevar in sas help for examples of how to do this. The xml map file will give you the xmlpath detail you need to locate the content within the xml file that you can read directly with data step code.
The alternative suggested in an earlier post will also work. Pre-process the n * xml files in a datastep as above but write the selected header content (opening tags) once, then data content from each xml file (discarding header content for each subsequent file) then write the closing tags once. This process is very quick. Then your original xml map file will process the large xml file once. You can test this approach pretty quickly by manually editing two xml files to collapse them into one. This will quickly tell you what content is in common that you need once
Upvotes: 0