Jamie Pollard
Jamie Pollard

Reputation: 1599

Importing a UTF-8 XML file into SAS, how to set encoding?

Firstly I am not a SAS programmer, so forgive me if this question is too easy or is difficult to follow!

I have an application which creates UTF-8 encoded XML files (and map files) that are to be read into SAS (9.3). These files can contain characters such the following (note the less than or equals):

<DocumentElement>
  <DATA>
    <TEXT>≤ 50 %</TEXT>
  </DATA>
</DocumentElement>

We have an external third party attempting to read these files, but I understand that SAS's default encoding is Wlatin1.

I have tried giving them a number of options based on the SAS docs as to what options to specify when reading these files, but I can't seem to get the correct combination of encoding options. Basically I want to import the XML, with a given MAP, into a dataset in SAS preserving the UTF-8 character encoding.

Assuming we are using libname xml, the docs suggest the following to read the xml:

filename NHL 'C:\My Documents\XML\NHL.xml';
filename MAP 'C:\My Documents\XML\NHL.map';
libname NHL xml xmlmap=MAP;

proc print data=NHL.TEAMS; 
run;

Which statements do I have to apply encoding options to, (I have tried the libname statement with XMLENCODING, INENCODING and OUTENCODING

Upvotes: 1

Views: 2921

Answers (2)

Dominic Comtois
Dominic Comtois

Reputation: 10411

Whichever encoding is used during your sas session, you can use filename's encoding= option, which will inform sas about the encoding used by that external file. It will not impact the encoding used to write the data in a sas table, but will make sure the input files are read correctly.

filename NHL 'C:\My Documents\XML\NHL.xml' encoding="utf-8";
filename MAP 'C:\My Documents\XML\NHL.map' encoding="utf-8";

Note however that SAS expects utf-8 BOM characters to be present.

Upvotes: 1

Jamie Pollard
Jamie Pollard

Reputation: 1599

Ok, think I figured this out.

It turns out SAS has a session encoding, which it will try to transcode the data to if the input files do not match. Running SAS with a session encoding of UTF-8 avoids all of these issues, and you can then specify the ENCODING= option if required for any files (which I don't have to, as they are already utf-8).

SAS have a paper about this here.

Upvotes: 1

Related Questions