Debugger
Debugger

Reputation: 564

CSV processing in Hadoop

I have 6 fields in a csv file:

I am writing mapreduce in java, splitting all fields with comma and sending student name in key and marks in value of map.

In reduce I'm processing them outputting student name in key and theire marks plus total, average, etc in value of reduce.

I think there may be an alternative, and more efficient way to do this.

Has anyone got an idea of a better way to do this these operations?

Are there any inbuilt functions of hadoop which can group by student name and can calculate total marks and average associated to thaty student?

Upvotes: 1

Views: 3957

Answers (4)

Federico
Federico

Reputation: 5778

Use HIVE.It simpler than writing mapreduce in java and might be me more familiar than PIG, since it's SQL like syntax.

https://cwiki.apache.org/confluence/display/Hive/Home

What you have to do is 1) install hive client in your machine or 1 node and point it to your cluster. 2) create the tables description for that file 3) load the data 4) write the SQL. Since it think your data looks like student_name, subject_mark1, subject_mark2, etc you might need to use explode https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode

2) CREATE TABLE students(name STRING, subject1 INT,subject2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE;

3) LOAD DATA INPATH '/path/to/data/students.csv' INTO TABLE students;

4) SELECT name, AVG(subject1), AVG(subject2) FROM students GROUP BY name;

output might look like:

NAME | SUBJECT1 | SUBJECT 2

john | 6.2 | 7.0

tom | 3.5 | 5.0

Upvotes: 1

Praveen Sripati
Praveen Sripati

Reputation: 33495

I am writing mapreduce in java, splitting all fields with comma and sending student name in key and marks in value of map.

In reduce I'm processing them outputting student name in key and theire marks plus total, average, etc in value of reduce.

This can be easily written as a map only job, there is no need for a reducer. Once the mapper gets a row from the CSV, split them and calculate as required in the mapper only. And emit the student name as key and average/total etc as value.

Upvotes: 0

Arnon Rotem-Gal-Oz
Arnon Rotem-Gal-Oz

Reputation: 25909

You can set your reducer to run as combiner in addition to running as a reducer so you can perform interim calculation before sending all to the reducer.

As Nicolas78 said you should consider looking at pig which does a pretty good job of building an efficient map/reduce and saving you both code and effort

Upvotes: 0

Nicolas78
Nicolas78

Reputation: 5144

You might want to have a look at Pig http://pig.apache.org/ which provides a simple language on top of Hadoop that lets you perform many standard tasks with much shorter code.

Upvotes: 2

Related Questions