Reputation: 252
Most of the queries I run have this format:
SELECT COUNT(*),A.a1 FROM A WHERE A.a2='x' GROUP BY A.a1
And the A
table is an hdfs folder with csv files in it.
Since hive
is ridiculously slow, how can I implement this query in map-reduce
?
Thanks
Upvotes: 1
Views: 5716
Reputation: 13700
You could use hadoop streaming instead of hive,
Yet I can't guarantee it would be much faster
hadoop jar hadoop-streaming-*.jar -D -jobconf stream.non.zero.exit.is.failure=false –mapper ‘grep {p}’ –reducer ’wc -l’ –input {d} –output tmp
hadoop fs -getmerge tmp {d}
hadoop fs –rmr tmp
replace {d}
with your input dir, and {p}
with a regex pattern that catches your condition
Upvotes: 2
Reputation: 3173
May the below steps would give you the idea about this,
From Mapper ,
From Reducer,
Please go through the Summarization patterns from the Mapreduce design patterns book by Donald Miner.
Upvotes: 1
Reputation: 18987
Your SQL query can be mapped to the HelloWorld equivalent of MapReduce: WordCount.
I doubt that a custom implementation can do it much faster than Hive (which compiles to MapReduce), but here is how to do it:
(pos, line)
(pos, line)
-> Mapper
: parse / tokenize line, extract a1
and a2
, filter on a2='x'
) -> (a1, count=1)
(a1, count)
-> Reducer
: Sum all count
fields -> (a1, sum)
(a1, sum)
-> some OutputFormatHadoop will automatically take care of the GroupBy
because the Mapper sets a1
as key field.
Don't forget to set the Reducer also as Combiner to enable local pre-aggregation.
If you are looking for faster execution, you should consider another file format (Parquet, ORC) and another engine (Hive-on-Tez, Apache Flink, Apache Spark, or even a specialized engine for structured data like Apache Drill, Impala, Apache Kylin, Presto, ...). The best choice depends among other things on the size of your data, the execution time requirements (sub-second, < 1 min, ...), and what kind of other use-cases you would like to handle.
Upvotes: 2