How does this class work? (Related to Quantopian, Python and Pandas)

Question

From here: https://www.quantopian.com/posts/wsj-example-algorithm

class Reversion(CustomFactor):
    """
    Here we define a basic mean reversion factor using a CustomFactor. We
    take a ratio of the last close price to the average price over the
    last 60 days. A high ratio indicates a high price relative to the mean
    and a low ratio indicates a low price relative to the mean.
    """
    inputs = [USEquityPricing.close]
    window_length = 60   

    def compute(self, today, assets, out, prices):
        out[:] = -prices[-1] / np.mean(prices, axis=0)

Reversion() seems to return a pandas.DataFrame, and I have absolutely no idea why. For one thing, where is inputs and window_length used? And what exactly is out[:]?

Is this specific behavior related to Quantopian in particular or Python/Pandas?

ssanderson · Accepted Answer

TL;DR

Reversion() doesn't return a DataFrame, it returns an instance of the Reversion class, which you can think of as a formula for performing a trailing window computation. You can run that formula over a particular time period using either quantopian.algorithm.pipeline_output or quantopian.research.run_pipeline, depending on whether you're writing a trading algorithm or doing offline research in a notebook.
The compute method is what defines the "formula" computed by a Reversion instance. It calculates a reduction over a 2D numpy array of prices, where each row of the array corresponds to a day and each column of the array corresponds to a stock. The result of that computation is a 1D array containing a value for each stock, which is copied into out. out is also a numpy array. The syntax out[:] = says "copy the values from into out".

compute writes its result directly into an output array instead of simply returning because doing so allows the CustomFactor base class to ensure that the output has the correct shape and dtype, which can be nontrivial for more complex cases.

Having a function "return" by overwriting an input is unusual and generally non-idiomatic Python. I wouldn't recommend implementing a similar API unless you're very sure that there isn't a better solution.
All of the code in the linked example is open source and can be found in Zipline, the framework on top of which Quantopian is built. If you're interested in the implementation, the following files are good places to start:
- zipline/pipeline/engine.py
- zipline/pipeline/term.py
- zipline/pipeline/graph.py
- zipline/pipeline/pipeline.py
- zipline/pipeline/factors/factor.py
You can also find a detailed tutorial on the Pipeline API here.

I think there are two kinds of answers to your question:

How does the Reversion class fit into the larger framework of a Zipline/Quantopian algorithm? In other words, "how is the Reversion class used"?
What are the expected inputs to Reversion.compute() and what computation does it perform on those inputs? In other words, "What, concretely, does the Reversion.compute() method do?

It's easier to answer (2) with some context from (1).

How is the `Reversion` class used?

Reversion is a subclass of CustomFactor, which is part of Zipline's Pipeline API. The primary purpose of the Pipeline API is to make it easy for users to perform a certain special kind of computation efficiently over many sources of data. That special kind of computation is a cross-sectional trailing-window computation, which has the form:

Every day, for some set of data sources, fetch the last N days of data for all known assets and apply a reduction function to produce a single value per asset.

A very simple cross-sectional trailing-window computation would be something like "close-to-close daily returns", which has the form:

Every day, fetch the last two days' of close prices and, for each asset, calculate the percent change between the asset's previous day close price and its current current close price.

To describe a cross-sectional trailing-window computation, we need at least three pieces of information:

On what kinds of data (e.g. price, volume, market cap) does the computation operate?
On how long of a trailing window of data (e.g. 1 day, 20 days, 100 days) does the computation operate?
What reduction function does the computation perform over the data described by (1) and (2)?

The CustomFactor class defines an API for consolidating these three pieces of information into a single object.

The inputs attribute describes the set of inputs needed to perform a computation. In the snippet from the question, the only input is USEquityPricing.close, which says that we just need trailing daily close prices. In general, however, we can ask for any number of inputs. For example, to compute VWAP (Volume-Weighted Average Price), we would use something like inputs = [USEquityPricing.close, USEquityPricing.volume] to say that we want trailing close prices and trailing daily volumes.
The window_length attribute describes the number of days of trailing data required to perform a computation. In the snippet above we're requesting 60 days of trailing close prices.
The compute method describes the trailing-window computation to be performed. In the section below, I've outlined exactly how compute performs its computation. For now, it's enough to know that compute is essentially a reduction function from some number of 2-dimensional arrays to a single 1-dimensional array.

You might notice that we haven't defined an actual set of dates on which we might want to compute a Reversion factor. This is by design, since we'd like to be able to use the same Reversion instance to perform calculations at different points in time.

Quantopian defines two APIs for computing expressions like Reversion: an "online" mode designed for use in actual trading algorithms, and a "batch" mode designed for use in research and development. In both APIs, we first construct a Pipeline object that holds all the computations we want to perform. We then feed our pipeline object into a function that actually performs the computations we're interested in.

In the batch API, we call run_pipeline passing our pipeline, a start date, and an end date. A simple research notebook computing a custom factor might look like this:

from quantopian.pipeline import Pipeline, CustomFactor
from quantopian.research import run_pipeline

class Reversion(CustomFactor):
    # Code from snippet above.

reversion = Reversion()
pipeline = Pipeline({'reversion': reversion})
result = run_pipeline(pipeline, start_date='2014-01-02', end_date='2015-01-02')
do_stuff_with(result)

In a trading algorithm, we're generally interested in the most recently computed values from our pipeline, so there's a slightly different API: we "attach" a pipeline to our algorithm on startup, and we request the latest output from the pipeline at the start of each day. A simple trading algorithm using Reversion might look something like this:

import quantopian.algorithm as algo
from quantopian.pipeline import Pipeline, CustomFactor


class Reversion(CustomFactor):
    # Code from snippet above.

def initialize(context):
    reversion = Reversion()
    pipeline = Pipeline({'reversion': reversion})
    algo.attach_pipeline(pipeline, name='my_pipe')

def before_trading_start(context, data):
    result = algo.pipeline_output(name='my_pipe')
    do_stuff_with(result)

The most important thing to understand about the two examples above is that simply constructing an instance of Reversion doesn't perform any computation. In particular, the line:

reversion = Reversion()

doesn't fetch any data or call the compute method. It simply creates an instance of the Reversion class, which knows that it needs 60 days of close prices each day to run its compute function. Similarly, USEquityPricing.close isn't a DataFrame or a numpy array or anything like that: it's just a sentinel value that describes what kind of data Reversion needs as an input.

One way to think about this is by an analogy to mathematics. An instance of Reversion is like a formula for performing a calculation, and USEquityPricing.close is like a variable in that formula.

Simply writing down the formula doesn't produce any values; it just gives us a way to say "here's how to compute a result if you plug in values for all of these variables".

We get a concrete result by actually plugging in values for our variables, which happens when we call run_pipeline or pipeline_output.

So what, concretely, does `Reversion.compute()` do?

Both run_pipeline and pipeline_output ultimately boil down to calls to PipelineEngine.run_pipeline, which is where actual computation happens.

To continue the analogy from above, if reversion is a formula, and USEquityPricing.close is a variable in that formula, then PipelineEngine is the grade school student whose homework assignment is to look up the value of the variable and plug it into the formula.

When we call PipelineEngine.run_pipeline(pipeline, start_date, end_date), the engine iterates through our requested expressions, loads the inputs for those expressions, and then calls each expression's compute method once per trading day between start_date and end_date with appropriate slices of the loaded input data.

Concretely, the engine expects that each expression has a compute method with a signature like:

def compute(self, today, assets, out, input1, input2, ..., inputN):

The first four arguments are always the same:

self is the CustomFactor instance in question (e.g. reversion in the snippets above). This is how methods work in Python in general.
today is a pandas Timestamp representing the day on which compute is being called.
assets is a 1-dimensional numpy array containing an integer for every tradeable asset on today.
out is a 1-dimensional numpy array of the same shape as assets. The contract of compute is that it should write the result of its computation into out.

The remaining parameters are 2-D numpy arrays with shape (window_length, len(assets)). Each of these parameters corresponds to an entry in the expression's inputs list. In the case of Reversion, we only have a single input, USEquityPricing.close, so there's only one extra parameter, prices, which contains a 60 x len(assets) array containing 60 days of trailing close prices for every asset that existed on today.

One unusual feature of compute is that it's expected to write its computed results into out. Having functions "return" by mutating inputs is common in low level languages like C or Fortran, but it's rare in Python and generally considered non-idiomatic. compute writes its outputs into out partly for performance reasons (we can avoid extra copying of large arrays in some cases), and partly to make it so that CustomFactor implementors don't need to worry about constructing output arrays with correct shapes and dtypes, which can be tricky in more complex cases where a user has more than one return value.

How does this class work? (Related to Quantopian, Python and Pandas)

Answers (2)

How is the `Reversion` class used?

So what, concretely, does `Reversion.compute()` do?

Related Questions

How does this class work? (Related to Quantopian, Python and Pandas)

Answers (2)

How is the Reversion class used?

So what, concretely, does Reversion.compute() do?

Related Questions

How is the `Reversion` class used?

So what, concretely, does `Reversion.compute()` do?