Stateful Java UDF in PySpark

Question

I want to create a UDF for PySpark based on some Java code. The UDF signature is quite similar to regex match. The first argument will come from data frames, while the second will be the same. The problem here is like in regex, it is time consuming to parse regex every time, so it can be cached. In my case, the second argument is even more heavier to parse, then regex. It is a DSL represented by JSON. How can I do this caching?

One of the idea, is to maintain a static cache on Java side, generate some ID on driver side, add my JSON to caches on each worker with generated ID, than pass that ID, to UDF, so it can access already parsed JSON inside. How can I achieve this? Or may be there are some other variants?

Stateful Java UDF in PySpark

Answers (0)

Related Questions