Toneo
Toneo

Reputation: 1

Solving a big data issue with Hadoop

My project consists at comparing two big files (csv files > 4Go) which have same data but not the same structure (column id can be the 1st one in file 1 and 9th one in file 2 ...). I was thinking that i can solve this problem using a MapReduce program. But actually i'm confused by reading a little about Pig and Hive...

Does Hive make this problem easier and do i need to use a map/reduce program?

Upvotes: 0

Views: 149

Answers (2)

Richard Taylor
Richard Taylor

Reputation: 780

If you're willing to look at a non-Hadoop solution, this problem is relatively trivial to code for the HPCC platform (http://hpccsystems.com/).

First you would spray the files onto the HPCC platform, then define the two files, with their separate structures, and compare them, most likely using a JOIN function. Here is some fully-functional code to demonstrate how it's done:

rec1 := RECORD  //record layout of first file
  UNSIGNED ID;
  STRING20 txt1;
  STRING20 txt2;
  STRING20 txt3;
END;    

rec2 := RECORD  //record layout of second file
  STRING20 Str2;
  STRING20 Str1;
  STRING20 Str3;
  UNSIGNED ColumnID; 
END;    

// This is the way the files would be defined using your CSV files:
// ds1 := DATASET('FirstCSVfile',rec1,CSV);
// ds2 := DATASET('SecondCSVfile',rec2,CSV);

// These inline files just demo the code:
ds1 := DATASET([{1,'Field1','Field2','Field3'},
                {2,'Field1','Field2','Field3'},
                {3,'Field1','Field2','Field3'},
                {4,'Field1','Field2','Field3'},
                {5,'Field1','Field2','Field3'}],rec1);

ds2 := DATASET([{'Field2','Field1','Field3',1},
                {'F2','Field1','Field3',2},
                {'Field2','F1','Field3',3},
                {'Field2','Field1','Field3',5}],rec2);

Rec1 CmpFields(Rec1 L, Rec2 R) := TRANSFORM
  SELF.ID := L.ID;
  SELF.txt1 := IF(L.txt1=R.Str1,L.txt1,'');
  SELF.txt2 := IF(L.txt2=R.Str2,L.txt2,'');
  SELF.txt3 := IF(L.txt3=R.Str3,L.txt3,'');
END;

Cmp := JOIN(ds1,ds2,LEFT.ID = RIGHT.ColumnID,CmpFields(LEFT,RIGHT),LEFT OUTER);

Cmp;                                  //just show the result
Cmp(txt1='' AND txt2='' AND txt3=''); // filter for only non-matches on ID
Cmp(txt1='' OR  txt2='' OR  txt3=''); // filter for all non-matching field data

This is a simple LEFT OUTER JOIN of the two files based on matching ID field values (even though they're named and positioned differently). The TRANSFORM function does the field-by-field comparison (and note, these text fields are also named differently in the two files), simply producing blanks when the field values do not match.

Upvotes: 2

user2046117
user2046117

Reputation:

Hive is basically for putting a table structure over your data and allows you to execute SQL like commands against it. If you know the structure of the data well and can create the corresponding tables you can upload the data into HDFS and create an external table over the top

When you create the query, Hive interprets that to a MapReduce job and runs it against the data. No need to write your own map reduce job.

I don't personally like Talend, but it might be worth a look for you since it's free and does do this sort of thing well. I don't like it because you get a barrage of contact from Talend trying to sell consultancy services when you download it.

Give Talend a go and maybe look at Talend by example

Upvotes: 0

Related Questions