arsenal
arsenal

Reputation: 24144

Java MapReduce job to calculate the percentage

Below is my Table (MyTable)

ID          TotalCount   ErrorCount   DT
----------------------------------------------
1345653         5           3       20120709
534140349       5           2       20120709
601806615       5           1       20120709
682527813       4           3       20120709
687612723       3           2       20120709
704318001       5           4       20120709
1345653         5           2       20120710
704318001       1           0       20120710
1120784094      3           2       20120711

So If I need to calculate the error percentage in Hive using HiveQL for specific date, then I will be doing like this-

SELECT 100 * sum(ErrorCount*1.0) / sum(TotalCount) FROM MyTable 
where dt = '20120709'; 

But I need to do the same thing using Java MapReduce. Is there any way we can do the same thing using MapReduce in Java code. First of all I am confused whenever we write any MapReduce job in Java we read the corresponding file for that date partition? or we read the table?

Update:- Below is the table name which will contain the above scenario

create table lipy
( buyer_id bigint,
  total_chkout bigint,
  total_errpds bigint
 )
 partitioned by (dt string)
row format delimited fields terminated by '\t'
stored as sequencefile
location '/apps/hdmi-technology/lipy'
;

Upvotes: 0

Views: 1951

Answers (2)

Thomas Jungblut
Thomas Jungblut

Reputation: 20969

That is quite easy- let me give a shot at some pseudo code.

SELECT 100 * sum(ErrorCount*1.0) / sum(TotalCount) FROM MyTable 
where dt = '20120709'; 

Map Stage:

  • increment a counter for total counts (you can simply use a field)
  • check if the dt column is equal to 20120709
  • if yes, increment an error counter
  • in cleanup emit as Key/Value: -1 / totalcount and 0 / error counter

Reduce stage: (you get a totalcount for key -1 and the error counter as key 0)

  • add all the numbers from key -1 and key 0
  • in cleanup you can calculate your percentage, and maybe send a mail if that is possible

Several things to note:

  • Mapoutput is <IntWritable, IntWritable> or <IntWritable,LongWritable> if the count does not fit in a integer.
  • Set the number of reducers to 1, so a single reducer gets all the keys.

I believe this is everything to note, it is quite early here and I had no coffee, so if you find a problem, feel free to tell me ;)

Upvotes: 1

Edmon
Edmon

Reputation: 4872

You can do this but the implementation will depend on:

  1. Are your tables external About locations:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintofilesystemfromqueries

  1. How is data formatted - row format, delimited, ...
    http://hive.apache.org/docs/r0.9.0/language_manual/data-manipulation-statements.html

  2. How do you want to execute MapReduce. One very straightforward option is to run your Java MapReduce code as user defined functions (UDFs) that reuse HiveQL functions:

https://cwiki.apache.org/Hive/tutorial.html#Tutorial-Custommap%252Freducescripts

or simply run your custom mapreduce over Hive table data in HDFS.

Upvotes: 0

Related Questions