Reputation: 564
I have 6 fields in a csv
file:
String
) I am writing mapreduce
in java, splitting all fields with comma and sending student name in key and marks in value of map.
In reduce
I'm processing them outputting student name in key and theire marks plus total, average, etc in value of reduce
.
I think there may be an alternative, and more efficient way to do this.
Has anyone got an idea of a better way to do this these operations?
Are there any inbuilt functions of hadoop
which can group by student name and can calculate total marks and average associated to thaty student?
Upvotes: 1
Views: 3957
Reputation: 5778
Use HIVE.It simpler than writing mapreduce in java and might be me more familiar than PIG, since it's SQL like syntax.
https://cwiki.apache.org/confluence/display/Hive/Home
What you have to do is 1) install hive client in your machine or 1 node and point it to your cluster. 2) create the tables description for that file 3) load the data 4) write the SQL. Since it think your data looks like student_name, subject_mark1, subject_mark2, etc you might need to use explode https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode
2) CREATE TABLE students(name STRING, subject1 INT,subject2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE;
3) LOAD DATA INPATH '/path/to/data/students.csv' INTO TABLE students;
4) SELECT name, AVG(subject1), AVG(subject2) FROM students GROUP BY name;
output might look like:
NAME | SUBJECT1 | SUBJECT 2
john | 6.2 | 7.0
tom | 3.5 | 5.0
Upvotes: 1
Reputation: 33495
I am writing mapreduce in java, splitting all fields with comma and sending student name in key and marks in value of map.
In reduce I'm processing them outputting student name in key and theire marks plus total, average, etc in value of reduce.
This can be easily written as a map only job, there is no need for a reducer. Once the mapper gets a row from the CSV, split them and calculate as required in the mapper only. And emit the student name as key and average/total etc as value.
Upvotes: 0
Reputation: 25909
You can set your reducer to run as combiner in addition to running as a reducer so you can perform interim calculation before sending all to the reducer.
As Nicolas78 said you should consider looking at pig which does a pretty good job of building an efficient map/reduce and saving you both code and effort
Upvotes: 0
Reputation: 5144
You might want to have a look at Pig http://pig.apache.org/ which provides a simple language on top of Hadoop that lets you perform many standard tasks with much shorter code.
Upvotes: 2