HiveQL to Group by a substring of a column value and identify missing groups

Question

I have a hive table which has a column called DATALIST. It can have below values

XYZ_EF1_GHD_OLDDAT
XYZ_EF1_GHD_NEWFORMAT
ABC_FG1_JJJ_ASD_OLDDAT
TTT_LMN_IJK_YHVS_TXY_OLDDAT
TTT_LMN_IJK_YHVS_TXY_NEWFORMAT
PQR_GHSS_TYHT_NEWFORMAT

I need to create an output which would identify all the names which do not have both the _NEWFORMAT and _OLDDAT in the column.In those scenarios it should output the following

Value                          Reason
ABC_FG1_JJJ_ASD               Missing NEW
PQR_GHSS_TYHT                 Missing OLD
XYZ_EF1_GHD                   Contains Both NEW and OLD
TTT_LMN_IJK_YHVS_TXY          Contains both NEW and OLD

Any suggestion/help with the SQL/HIVEQL logic would be really appreciated.

GMB · Accepted Answer

In a subquery, you could use regexp functions regexp_replace() and regexp_extract() to separate the value from the suffix. Then, you can aggregate and apply the logic in the outer query:

select 
    value,
    case 
        when min(suffix) = '_NEWFORMAT' and max(suffix) = '_OLDDAT' 
            then 'Contains both NEW and OLD'
        when max(suffix) = '_NEWFORMAT'
            then 'Missing OLD'
        else 'Missing NEW'
    end reason
from (
    select 
        regexp_replace(datalist, '_OLDDAT|_NEWFORMAT$', '') value,
        regexp_extract(datalist, '(_OLDDAT|_NEWFORMAT)$', 1) suffix
    from mytable
) x
group by value

HiveQL to Group by a substring of a column value and identify missing groups

Answers (1)

Related Questions