Reputation: 4652
I would like to get the percentage of NULL values in a table in Hive. Is there an easy way to do this without having to enumerate all column names in the query? In this case there are about 50k rows and 20 columns. Thanks in advance!
Something like:
SELECT count(each_column) / count(*) FROM TABLE_1
WHERE each_column = NULL;
Upvotes: 2
Views: 2655
Reputation: 21561
The approach you need depends on the situation that you have:
I once wrote a python script. I now don't have it at hand but it is quite easy to create with the following logic:
Of course it can be expanded to run for different tables, and statistics, but do realize that this may not scale well.
In my case I think I had to cut the query building in batches of 20 columns each time which would then be concatenated afterwards, because running it on 400 columns just generated a too complex query.
Upvotes: 1
Reputation: 1270993
If you do this using code, you need to list the columns. Here is one way:
select avg(case when col1 is null then 1.0 else 0.0 end) as col1_null_p,
avg(case when col2 is null then 1.0 else 0.0 end) as col2_null_p,
. . .
from t;
If you take the list of columns in the table, you can readily construct the query in a spreadsheet.
Upvotes: 3