Alejandro Rodriguez
Alejandro Rodriguez

Reputation: 252

SQL group by equivalent in map-reduce

Most of the queries I run have this format:

SELECT COUNT(*),A.a1 FROM A WHERE A.a2='x' GROUP BY A.a1

And the A table is an hdfs folder with csv files in it.

Since hive is ridiculously slow, how can I implement this query in map-reduce ?

Thanks

Upvotes: 1

Views: 5716

Answers (3)

Uri Goren
Uri Goren

Reputation: 13700

You could use hadoop streaming instead of hive,

Yet I can't guarantee it would be much faster

hadoop jar hadoop-streaming-*.jar -D -jobconf stream.non.zero.exit.is.failure=false –mapper ‘grep {p}’ –reducer ’wc -l’ –input {d} –output tmp
hadoop fs -getmerge tmp {d}
hadoop fs –rmr tmp

replace {d} with your input dir, and {p} with a regex pattern that catches your condition

Upvotes: 2

suresiva
suresiva

Reputation: 3173

May the below steps would give you the idea about this,

From Mapper ,

  1. Split the line with comma delimiter
  2. check if a2 = x and confirm
  3. write a1 as key and int 1 as value

From Reducer,

  1. iterate over all values of the given key
  2. do sum of all 1's of the key
  3. after loop , reuse the same key object as key and sum as value.

Please go through the Summarization patterns from the Mapreduce design patterns book by Donald Miner.

Upvotes: 1

Fabian Hueske
Fabian Hueske

Reputation: 18987

Your SQL query can be mapped to the HelloWorld equivalent of MapReduce: WordCount.

I doubt that a custom implementation can do it much faster than Hive (which compiles to MapReduce), but here is how to do it:

  • TextInputFormat -> (pos, line)
  • (pos, line) -> Mapper: parse / tokenize line, extract a1 and a2, filter on a2='x') -> (a1, count=1)
  • (a1, count) -> Reducer: Sum all count fields -> (a1, sum)
  • (a1, sum) -> some OutputFormat

Hadoop will automatically take care of the GroupBy because the Mapper sets a1 as key field.

Don't forget to set the Reducer also as Combiner to enable local pre-aggregation.

If you are looking for faster execution, you should consider another file format (Parquet, ORC) and another engine (Hive-on-Tez, Apache Flink, Apache Spark, or even a specialized engine for structured data like Apache Drill, Impala, Apache Kylin, Presto, ...). The best choice depends among other things on the size of your data, the execution time requirements (sub-second, < 1 min, ...), and what kind of other use-cases you would like to handle.

Upvotes: 2

Related Questions