JohnDoe
JohnDoe

Reputation: 2514

How to encode RDF N-Triples string literals?

The specification for RDF N-Triples states that string literals must be encoded.

https://www.w3.org/TR/n-triples/#grammar-production-STRING_LITERAL_QUOTE

Does this "encoding" have a name I can look up to use it in my programming language? If not, what does it mean in practice?

Upvotes: 2

Views: 1632

Answers (3)

coderfi
coderfi

Reputation: 378

You could use Literal#n3()

e.g.

# pip install rdflib

>>> from rdflib import Literal
>>> lit = Literal('This "Literal" needs escaping!')
>>> s = lit.n3()
>>> print(s)
"This \"Literal\" needs escaping!"

Upvotes: 3

jschnasse
jschnasse

Reputation: 9498

In addition to Josh's answer. It is almost always a good idea to normalize unicode data to NFC,e.g. in Java you can use the following routine

java.text.Normalizer.normalize("rdf literal", Normalizer.Form.NFKC);

For more information see: http://www.macchiato.com/unicode/nfc-faq

What is NFC?

For various reasons, Unicode sometimes has multiple representations of the same character. For example, each of the following sequences (the first two being single-character sequences) represent the same character:

U+00C5 ( Å ) LATIN CAPITAL LETTER A WITH RING ABOVE
U+212B ( Å ) ANGSTROM SIGN
U+0041 ( A ) LATIN CAPITAL LETTER A + U+030A ( ̊ ) COMBINING RING ABOVE

These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for compostion. For more information on these, see the introduction of UAX #15: Unicode Normalization Forms. A function transforming a string S into the NFC form can be abbreviated as toNFC(S), while one that tests whether S is in NFC is abbreviated as isNFC(S).

Upvotes: 2

Joshua Taylor
Joshua Taylor

Reputation: 85813

The grammar productions that you need are right in the document that you linked to:

[9] STRING_LITERAL_QUOTE    ::= '"' ([^#x22#x5C#xA#xD] | ECHAR | UCHAR)* '"'
[141s]  BLANK_NODE_LABEL    ::= '_:' (PN_CHARS_U | [0-9]) ((PN_CHARS | '.')* PN_CHARS)?
[10]    UCHAR   ::= '\u' HEX HEX HEX HEX | '\U' HEX HEX HEX HEX HEX HEX HEX HEX
[153s]  ECHAR   ::= '\' [tbnrf"'\]

This means that a string literal begins and ends with a double quote ("). Inside of the double quotes, you can have:

  • any character except: #x22, #x5C, #xA, #xD. Offhand, I don't know what each of those is, but I'd assume that they're the space characters covered in the escapes;
  • a unicode character represented with a \u followed by four hex digits, or a \U followed by eight hex digits; or
  • an escape character, which is a \ followed by any of t, b, n, r, f, ", ', and \, which represent various characters.

Upvotes: 4

Related Questions