user2523848
user2523848

Reputation: 345

complex Hive Query

Hi I have the following table:

ID------ |--- time 
======================
5-------  | ----200101
3--------| ---  200102  
2--------|----  200103  
12 ------|----  200101 
16-------|----  200103  
18-------|----  200106

Now I want to know how often a certain month in the year appears. I cant use a group by because this only counts the number of times which appears in the table. But I also want to get a 0 when a certain month in the year does not appear. So the output should be something like this:

time-------|----count
=====================
200101--|--      2

200102--|--      1

200103--|--      1

200104--|--      0

200105--|--      0

200106--|--      1

Sorry for the bad table format, I hope it is still clear what I mean. I would apreciate any help

Upvotes: 0

Views: 6883

Answers (1)

zsxwing
zsxwing

Reputation: 20816

You can provide a year-month table containing all year and month information. I wrote a script for you to generate such csv file:

#!/bin/bash

# year_month.sh

start_year=1970
end_year=2015

for year in $( seq ${start_year} ${end_year} ); do
    for month in $( seq 1 12 ); do
        echo ${year}$( echo ${month} | awk '{printf("%02d\n", $1)}');
    done;
done > year_month.csv

Save it in year_month.sh and run it. Then you will get a file year_month.csv containing the year and month from 1970 to 2015. You can change start_year and end_year to specify the year range.

Then, upload the year_month.csv file to HDFS. For example,

hadoop fs -mkdir /user/joe/year_month
hadoop fs -put year_month.csv /user/joe/year_month/

After that, you can load year_month.csv into Hive. For example,

create external table if not exists 
year_month (time int) 
location '/user/joe/year_month';

At last, you can join the new table with your table to get the final result. For example, assume your table is id_time:

from (select year_month.time as time, time_count.id as id 
      from year_month 
      left outer join id_time 
      on year_month.time = id_time.time) temp
select time, count(id) as count 
group by time;

Note: you need to make tiny modification (such as path, type) to the above statement.

Upvotes: 3

Related Questions