Reputation: 345
Hi I have the following table:
ID------ |--- time
======================
5------- | ----200101
3--------| --- 200102
2--------|---- 200103
12 ------|---- 200101
16-------|---- 200103
18-------|---- 200106
Now I want to know how often a certain month in the year appears. I cant use a group by because this only counts the number of times which appears in the table. But I also want to get a 0 when a certain month in the year does not appear. So the output should be something like this:
time-------|----count
=====================
200101--|-- 2
200102--|-- 1
200103--|-- 1
200104--|-- 0
200105--|-- 0
200106--|-- 1
Sorry for the bad table format, I hope it is still clear what I mean. I would apreciate any help
Upvotes: 0
Views: 6883
Reputation: 20816
You can provide a year-month table containing all year and month information. I wrote a script for you to generate such csv file:
#!/bin/bash
# year_month.sh
start_year=1970
end_year=2015
for year in $( seq ${start_year} ${end_year} ); do
for month in $( seq 1 12 ); do
echo ${year}$( echo ${month} | awk '{printf("%02d\n", $1)}');
done;
done > year_month.csv
Save it in year_month.sh
and run it. Then you will get a file year_month.csv
containing the year and month from 1970 to 2015. You can change start_year
and end_year
to specify the year range.
Then, upload the year_month.csv
file to HDFS. For example,
hadoop fs -mkdir /user/joe/year_month
hadoop fs -put year_month.csv /user/joe/year_month/
After that, you can load year_month.csv
into Hive. For example,
create external table if not exists
year_month (time int)
location '/user/joe/year_month';
At last, you can join the new table with your table to get the final result. For example, assume your table is id_time
:
from (select year_month.time as time, time_count.id as id
from year_month
left outer join id_time
on year_month.time = id_time.time) temp
select time, count(id) as count
group by time;
Note: you need to make tiny modification (such as path, type) to the above statement.
Upvotes: 3