Reputation: 11
In my project I saw two Hive tables and in the create table statement I saw one table has ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0004' and another table has ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u001C'. I want to know what does these '\u0004' and '\u001C' mean and when to use them? Kindly answer.
Upvotes: 0
Views: 513
Reputation: 9944
In many text formats, \u
introduces a Unicode escape sequence. This is a way of storing or sending a character that can't be easily displayed or represented in the format you're using. The four characters after the \u
are the Unicode "code point" in hexadecimal. A Unicode code point is a number denoting a specific Unicode character.
All characters have a code point, even the printable ones. For example, a
is U+0061
.
U+0004
and U+001C
are both unprintable characters, meaning there's no standard character you can use to display them on the screen. That's why an escape sequence is used here.
If you use a simple, printable character like ,
as your field delimiter, it will make the stored data easier for a human to read. The field values will be stored with a ,
between each one. For example, you might see the values one
, two
and three
stored as:
one,two,three
But if you expect your field values to actually contain a ,
, it would be a poor choice of field delimiter (because then you'd need a special way to tell the difference between a single field with a value of one,two
or two different fields with the values one
and two
). The choice of delimiter depends both on whether you want to be able to read it easily, and what characters you expect the field to contain.
Upvotes: 0