Reputation: 287
I have a hive table which has a column called DATALIST. It can have below values
XYZ_EF1_GHD_OLDDAT
XYZ_EF1_GHD_NEWFORMAT
ABC_FG1_JJJ_ASD_OLDDAT
TTT_LMN_IJK_YHVS_TXY_OLDDAT
TTT_LMN_IJK_YHVS_TXY_NEWFORMAT
PQR_GHSS_TYHT_NEWFORMAT
I need to create an output which would identify all the names which do not have both the _NEWFORMAT and _OLDDAT in the column.In those scenarios it should output the following
Value Reason
ABC_FG1_JJJ_ASD Missing NEW
PQR_GHSS_TYHT Missing OLD
XYZ_EF1_GHD Contains Both NEW and OLD
TTT_LMN_IJK_YHVS_TXY Contains both NEW and OLD
Any suggestion/help with the SQL/HIVEQL logic would be really appreciated.
Upvotes: 1
Views: 51
Reputation: 222572
In a subquery, you could use regexp functions regexp_replace()
and regexp_extract()
to separate the value from the suffix. Then, you can aggregate and apply the logic in the outer query:
select
value,
case
when min(suffix) = '_NEWFORMAT' and max(suffix) = '_OLDDAT'
then 'Contains both NEW and OLD'
when max(suffix) = '_NEWFORMAT'
then 'Missing OLD'
else 'Missing NEW'
end reason
from (
select
regexp_replace(datalist, '_OLDDAT|_NEWFORMAT$', '') value,
regexp_extract(datalist, '(_OLDDAT|_NEWFORMAT)$', 1) suffix
from mytable
) x
group by value
Upvotes: 1