Ravi
Ravi

Reputation: 3303

Using regexp_extract in Hive

I am trying to find the rows from a hive table where a particular column does not contain null values or \N values or STX character '\002'. The objective is to find which rows contain some characters other than these three.

I tried this hive query:

    select column1,length(regexp_replace(column1,'\N|\002|NULL','')) as value         
    FROM table1 LIMIT 10;

I was expecting zero in the following cases but I am getting the following:

    column1 value
    NULL    NULL
            0
    NULL    NULL
            0
    \N\N\N\N\N\N\N\N        8
    NULL    NULL
    \N\N\N\N\N\N\N\N        8
    NULL    NULL
    NULL    NULL
    \N\N\N  3

Could someone please help me on the correct regex for the above case?

Thank you.

Ravi

Upvotes: 3

Views: 2388

Answers (1)

Ibrahim Najjar
Ibrahim Najjar

Reputation: 19423

It looks that hive is using Java's regular expression engine so the problem seems to be with the regex itself, more specifically in the escape sequences.

Try the following and if it doesn't work then please let me know:

(?:(?:\\\\N)+|\002|NULL)

Upvotes: 1

Related Questions