Hafiz Muhammad Shafiq
Hafiz Muhammad Shafiq

Reputation: 8680

How to get text bytes used by a string in Hive?

I have some data in Hive 1.2.1 table. I have to get raw bytes of a specific column. The column data is html raw in multiple languages. In order to get length of characters, I can use simple query like below

select baseurl, LENGTH(content) from clss limit 30;

Above query is ok for characters length the problem is for text other is English, their value is incorrect. For a Character in Arabic, it is saved as unicoded that's why character length is changed. Some characters are of two bytes and some are single byte.

Is there any builtin function to know bytes of text instead of characters ?

Upvotes: 1

Views: 2758

Answers (1)

leftjoin
leftjoin

Reputation: 38335

Function character_length(string str) was added in Jira HIVE-15979 And it says Fix versions 2.3.0. If you cannot upgrade your Hive (and this is quite risky), then try to download UDF source codes and build it, then add jar and create temporary function.

Download code: GenericUDFCharacterLength.java

Upvotes: 1

Related Questions