Quantopian / Zipline: weird pattern in Pipeline package

Question

I recently found a very strange pattern in the "Pipeline" API from Quantopian/Zipline: they have a CustomFactor class, in which you will find a compute() method to be overriden when implementing your own Factor model.

The signature of compute() is: def compute(self, today, assets, out, *inputs), with the following comment for parameter "out":

Output array of the same shape as assets. compute should write its desired return values into out.

When I asked why the function could not simply return an output array instead of writing into an input parameter, I received the following answer:

"If the API required that the output array be returned by compute(), we'd end up doing a copy of the array into the actual output buffer which means an extra copy would get made unnecessarily."

I fail to understand why they end up doing so... Obviously in Python there is no issues about passing by value and there is no risk of unnecessarily copying data. This is really painful because this is the kind of implementations they are recommending people to code:

    def compute(self, today, assets, out, data):
       out[:] = data[-1]

So my question is, why could it not simply be:

    def compute(self, today, assets, data):
       return data[-1]

Scott Sanderson · Accepted Answer

(I designed and implemented the API in question here.)

You're right that Python objects aren't copied when passed into and out of functions. The reason there's a difference between returning a row out of your CustomFactor and writing values into a provided array has to do with copies that would be made in the code that's calling your CustomFactor compute method.

When the CustomFactor API was originally designed, the code that calls your compute method looked roughly like this:

def _compute(self, windows, dates, assets):
    # `windows` here is list of iterators yielding 2D slices of 
    # the user's requested inputs

    # `dates` and `assets` are row/column labels for the final output.

    # Allocate a (dates x assets) output array.
    # Each invocation of the user's `compute` function
    # corresponds to one row of output.
    output = allocate_output()

    for i in range(len(dates)):

        # Grab the next set of input arrays.
        inputs = [next(w) for w in windows]

        # Call the user's compute, which is responsible for writing
        # values into `out`.
        self.compute(
            dates[i], 
            assets,
            # This index is a non-copying operation.
            # It creates a view into row `i` of `output`.
            output[i],
            *inputs  # Unpack all the inputs.
        )

    return output

The basic idea here is that we've pre-fetched a sizeable amount of data, and we're now going to loop over windows into that data, call the user's compute function on the data, and write the result into a pre-allocated output array, which is then passed along to further transformations.

No matter what we do, we have to pay the cost of at least one copy to get the result of the user's compute function into the output array.

The most obvious API, as you point out, is to have the user simply return the output row, in which case the calling code would look like:

# Get the result row from the user.
result_row = self.compute(dates[i], assets, *inputs)
# Copy the user's result into our output buffer.
output[i] = result_row

If that were the API, then we're locked into paying at least the following costs for each invocation of the user's compute

Allocating the ~64000 byte array that the user will return.
A copy of the user's computed data into the user's output array.
A copy from the user's output array into our own, larger array.

With the existing API, we avoid costs (1) and (3).

With all that said, we've since made changes to how CustomFactors work that make some of the above optimizations less useful. In particular, we now only pass data to compute for assets that weren't masked out on that day, which requires a partial copy of the output array before and after the call to compute.

There are still some design reasons to prefer the existing API though. In particular, leaving the engine in control of the output allocation makes it easier for us to do things like pass recarrays for multi-output factors.

Quantopian / Zipline: weird pattern in Pipeline package

Answers (1)

Related Questions