Reputation: 2218
I have a large data in text files (1,000,000 lines) .Each line has 128 columns .
Now i am trying to build a kd tree with this large data . I want to use map reduce for calculations.
Brute Force approach for my problem:
1) write a map reduce job to find variance of each column and select the column with highest variance
2) taking (column name ,variance value ) as inputs write another map reduce job to split the input data into 2 parts . 1 part has all the rows with value less than input value for the given column name the second part has all the rows greater than input value.
3) for each part repeat step 1 and step 2 , continue the process until you are left with 500 values in each part.
the column name , variance value forms a single node for my tree . so with the brute force approach for tree of height 10 i need to run 1024 map reduce jobs.
My questions:
1 ) Is there any way i can improve the efficiency by running less number of map reduce jobs ?
2 ) I am reading the same data every time . Is there any way i can avoid that ?
3 ) are there any other frameworks like pig , hive etc which are efficient for this kind of tasks ?
4 ) Any frameworks using which i can save the data into a data store and easily retrieve data ?
Pleas help ...
Upvotes: 0
Views: 847
Reputation: 3619
With an MR job per node of the tree you have O(n) = 2^n number of jobs (where n is the height of the tree), which is not good for the overheads of the YARN. But with simple programming tricks you can bring it down to the O(n) = n. Here are some ideas:
Upvotes: 1
Reputation: 3845
Why don't you try using Apache Spark (https://spark.apache.org/) here ?...this seems like a perfect use case for spark
Upvotes: 2