vikrant rana
vikrant rana

Reputation: 4678

Perform validation and checks on Hive table (may not be a duplicate)

We know that Hive does not validate data based on the fields and its a user responsibility to check it manually. I am aware of few basic checks which we can perform to validate the data.

  1. Count the number of records.
  2. Number of nulls on each column
  3. Number of unique/distinct values on each column
  4. Column level stats like min, max etc based on the column/data type
  5. use Hive's inbuilt function to_date and others to check validation on date columns

I am sure that there must be few more checks or validation which we can perform to validate the data on Hive tables. Any suggestions most welcome.

Upvotes: 5

Views: 1847

Answers (1)

leftjoin
leftjoin

Reputation: 38335

Unfortunately you cannot generate this query for each column in Hive. Do it manually like this or generate based describe table output using shell or some other tools:

select count(*)                                 as total_records,
       --repeat these for each column
       count(case when col1 is null then 1 end) as col1_nulls_cnt,
       count(distinct col1)                     as col1_distinct,
       min(col1)                                as col1_min,
       max(col1)                                as col1_max
from your_table;

Dates can be validated using cast(col1 as date) :

select cast(col1 as date) --returns NULL if the date is in wrong format

You can calculate NULLs produced by cast like in the first query:

count(case when cast(col1 as date) is null then 1 end) as col1_wrong_dates_cnt

Also for more complex checking you can join with required date range which can be generated or generated like this and check if the date is joined or not, like this:

select col1,
       case when d.dt is not null then 'Ok' else 'Wrong date' end date_check 
  from your_table t
  left join date_range d on t.col1=d.d.dt

Numeric/other primitive types columns can be checked also using the same cast() like in this answer: https://stackoverflow.com/a/38143497/2700344.

One important thing to bear in mind about Hive: When you insert wrong format string in the date/timestamp column, Hive will silently without exception convert it to NULL. This will happen with most primitive types. But if you will try to insert bigint into int column, Hive will silently truncate it, producing some different number which fits in the int size. Taking into account all these, better to build table with all STRINGs on top of raw data before validation.

Upvotes: 2

Related Questions