Reputation: 8259
Say I'm building a UDF class called StaticLookupUDF that has to load some static data from a local file during construction.
In this case I want to ensure that I'm not replicating work more than I need to be, in that I don't want to re-load the static data on every call to the evaluate() method.
Clearly each mapper uses it's own instantiation of the UDF, but does a new instance get generated for each record processed?
For example, a mapper is going to process 3 rows. Does it create a single StaticLookupUDF and call evaluate() 3 times, or does it create a new StaticLookupUDF for each record, and call evaluate only once per instance?
If the second example is true, in what alternate way should I structure this?
Couldn't find this anywhere in the docs, I'm going to look through the code, but figured I'd ask the smart people here at the same time.
Upvotes: 5
Views: 1799
Reputation: 8259
Still not totally sure about this, but I got around it by having a static lazy value that loaded data as needed.
This way you have one-instance of the static value per mapper. So if you're reading in a dataset and you have 6 map tasks you'll read in the data 6 times. Not ideal, but better than once per record.
Upvotes: 2