Reputation: 28971
A certain job I'm running needs to collect some metadata from a DB (MySQL, although that's not as relevant) before processing some large HDFS files. This metadata will be added to data in the files and passed on to the later map/combine/reduce stages.
I was wondering where the "correct" place to put this query might be. I need the metadata to be available when the mapper begins, but placing it there seems redundant, as every Mapper will execute the same query. How can I (if at all) perform this query once and share its results across all the mappers? Is there a common way to share data between all the nodes performing a task (other than writing it to HDFS)? thanks.
Upvotes: 1
Views: 499
Reputation: 503
I would use swoop if you have the cloudera distribution for ease. I usually program with cascading in java and for db sources use dbmigrate as a source "tap" making dbs a first class citizen. When using pks with dbmigrate, the performance has been adequate.
Upvotes: 0
Reputation: 443
You can have your MYSql query in your main function and the result of the query can be stored in a string. Then you can set the variable to the Hadoop Job Configuration object. The variables set in Configuration object can be accessed by all mappers.
Your main class looks like this....
JobConf conf = new JobConf(Driver.class);
String metainfo = <You metadata Info goes here>;
conf.set("metadata",metainfo);
So in you Map Class you can access the metadata value as follows
publi class Map(...){
String sMetaInfo="";
public void configure(JobConf job) {
sMetaInfo= job.get("metadata"); // Getting the metadata value from Job Configureation Object
}
public void map(....){
// Map Function
}
}
Upvotes: 3