Notebook

In this notebook, we implement nine common asset-filtering criteria as Pipeline API filters and analyze their outputs.

We build visualizations of our filter results using Pandas and Seaborn.

Several of our filters take advantage of Pipeline's recently-added support for string columns.

Analyzing a Trading Universe with Pipeline and String Data

Many publicly-traded assets are not well-suited to algorithmic trading.

There are a variety of reasons why an algorithm might not want to trade a particular asset: the algorithm may not have a complete set of data available for the asset; the asset may be an unusual instrument not easily purchased by the general public; or the asset may be an instrument like a Depository Receipt that carries undesired risks.

Since many equities algorithms only want to consider a subset of all theoretically-available equities, a common first step for many algorithms is to define a top-level "trading universe" containing just the equities that the algorithm author deems acceptable to trade.

Dollar volume and market cap filters have been possible since the initial Pipeline API release, and the addition of built-ins like AverageDollarVolume and IsPrimaryShare made it easier to implement common filters. Until recently, however, many important filtering criteria have been hard or impossible to implement because pipelines didn't have support for string-based data.

As of today, Quantopian supports loading string-typed columns from the morningstar fundamentals dataset. Expressions producing strings yield instances of the recently-released Classifier type, and there are several new methods on Classifier that can be used to produce Filters.

In [1]:
from itertools import product

import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import morningstar as mstar
from quantopian.pipeline.factors import AverageDollarVolume
from quantopian.pipeline.filters.morningstar import IsPrimaryShare
from quantopian.research import run_pipeline
In [2]:
# This is the value used by Morningstar for common stock in the 
# share_class_reference.security_type column.
COMMON_STOCK= 'ST00000001'


def make_filters(volume_threshold):
    """
    Create a Pipeline producing Filters implementing common acceptance criteria.
    """

    # Equities with an average daily volume greater than the supplied threshold.
    high_volume = (AverageDollarVolume(window_length=21) > volume_threshold)
    
    # Equities that morningstar lists as primary shares.
    # NOTE: This will return False for stocks not in the morningstar database.
    primary_share = IsPrimaryShare()
    
    # Equities for which morningstar's most recent Market Cap value is non-null.
    # This is a good proxy for determining whether we have fundamental data for the stock.
    have_market_cap = mstar.valuation.market_cap.latest.notnull()
    
    # Equities not listed as depository receipts by morningstar.
    # Note the inversion operator, `~`, at the start of the expression.
    not_depository = ~mstar.share_class_reference.is_depositary_receipt.latest
    
    # Equities that listed as common stock (as opposed to, say, preferred stock).
    #
    # This is our first string column. The .eq method used here produces a Filter returning
    # True for all asset/date pairs where security_type produced a value of 'ST00000001'.
    common_stock = mstar.share_class_reference.security_type.latest.eq(COMMON_STOCK)
    
    # Equities whose exchange id does not start with OTC (Over The Counter).
    # startswith() is a new method available only on string-dtype Classifiers.
    # It returns a Filter.
    not_otc = ~mstar.share_class_reference.exchange_id.latest.startswith('OTC')
    
    # Equities whose symbol (according to morningstar) ends with .WI
    # This generally indicates a "When Issued" offering.
    # endswith() works similarly to startswith().
    not_wi = ~mstar.share_class_reference.symbol.latest.endswith('.WI')
    
    # Equities whose company name ends with 'LP' or a similar string.
    # The .matches() method uses the standard library `re` module to match
    # against a regular expression.
    not_lp_name = ~mstar.company_reference.standard_name.latest.matches('.* L[\\. ]?P\.?$')
    
    # Equities with a null entry for the balance_sheet.limited_partnership field.
    # This is an alternative way of checking for LPs.
    not_lp_balance_sheet = mstar.balance_sheet.limited_partnership.latest.isnull()
    
    # Create a Pipeline computing all of our filters.
    p = Pipeline(
        columns={
            'High Volume': high_volume,
            'Primary Share': primary_share,
            'Known Market Cap': have_market_cap,
            'Not Depository Receipt': not_depository,
            'Common Stock': common_stock,
            'Not OTC': not_otc,
            'Not When Issued': not_wi,
            'Not LP (Company Name)': not_lp_name,
            'Not LP (Balance Sheet)': not_lp_balance_sheet,
        }
    )
    return p


def compute_filters(start, end, volume_threshold):
    """
    Compute the filters defined by make_filters from start to end
    """
    pipe = make_filters(volume_threshold)label:github 
    results = run_pipeline(pipe, start, end)
    
    # Sort by decreasing number of stocks eliminated.
    counts = results.sum().sort(inplace=False)
    return results.loc[:, counts.index]
In [3]:
results = compute_filters('2012-07', '2012-08', volume_threshold=750000)
print "Total Asset Count: %d" % len(results)
results.head()
Total Asset Count: 175239
Out[3]:
High Volume Primary Share Known Market Cap Common Stock Not Depository Receipt Not OTC Not LP (Company Name) Not LP (Balance Sheet) Not When Issued
2012-07-02 00:00:00+00:00 Equity(2 [AA]) True True True True True True True True True
Equity(21 [AAME]) False True True True True True True True True
Equity(24 [AAPL]) True True True True True True True True True
Equity(25 [AA_PR]) False False False False True True True True True
Equity(31 [ABAX]) True True True True True True True True True

How much are we filtering out?

The first question we might be interested in is: "What percentage of all known stocks are filtered out by each of our criteria?"

A simple bar chart is a good way to show relative magnitudes.

In [4]:
def plot_filter_counts(results, axis=None):
    """
    Plot the number of stocks filtered out by each column of `results`.
    """
    nrows = len(results)
    plot = ((~results).sum().sort(inplace=False, ascending=True) / nrows).plot(
        kind='barh',
        ax=axis, 
        grid=False,
    )

    plot.tick_params(axis='both', labelsize='x-large')
    plot.set_title("Percentage of Stocks Filtered by Criterion", fontsize='x-large')
    return plot

plot_filter_counts(results);

Conclusions:

  • Just shy of half of all equities are filtered out by a 750,000 daily dollar volume threshold in 2012.
  • Checking for primary share, known market cap, and common stock all filter out 40-45% of all equities. The surprisingly-similar filtration rate here is in part due to fact that these filters will all return False for equities that are in the Quantopian symbol database, but for which morningstar doesn't provide data. (IsPrimaryShare defaults to False for NULL values in the morningstar database, and the string array returned from share_class_reference.security_type will contain None for NULL values.)
  • About 5% of companies in the Quantopian database on the date in question are depository receipts.
  • There is only one ticker (LLL.WI) that looks like a 'When Issued' stock on the morningstar database for our query date.

How much do our filters overlap?

Another question we might care about is: "To what extent are our various criteria redundant with one another?" If there's too much overlap, then we're wasting time computing filters that will all exclude the same junk. If there's not enough overlap, then we'll filter out the entire universe and have nothing left to trade.

Since overlap is a pairwise relation, this is a good use-case for a heatmap. Let's create a heatmap showing, for each filter pair (A, B), the percentage of assets filtered out by A that were also filtered out by B:

In [5]:
def plot_redundancy_heatmap(results, axis=None):
    """
    Generate a heatmap showing the redundancy percentage between pairs of columns.

    This shows, for each filter pair (f_A, f_b), the percentage of stocks filtered out by 
    f_A that were also filtered out by f_B.
    """
    # This is clearer to implement in terms of what values were filtered **out**,
    # so flip the polarity of the computed values.
    results = ~results
    
    # Calculate ratios only on stocks that were filtered by at least one criterion
    results = results.loc[results.any(axis='columns')]
    
    ncols = len(results.columns)
    percent_redundant = np.zeros((ncols, ncols), dtype=float)
    for i, j in product(range(ncols), range(ncols)):

        # Get entries in j for rows filtered out by i.
        col = results.iloc[:, j][results.iloc[:, i]]
        if not len(col):
            # If i didn't filter anything, it's 100% redundant with everything.
            percent_redundant[i, j] = 1.0
        else:
            # Calculate the redundant percentage.
            percent_redundant[i, j] = col.sum() / float(len(col))
    
    # Plot our percentages as a heatmap with Seaborn.
    plot = sns.heatmap(
        pd.DataFrame(
            data=percent_redundant, 
            index=results.columns, 
            columns=results.columns,
        ),
        fmt='.4f',
        annot=True,
        ax=axis,
        linecolor='white',
        linewidths=2,
    )
    plot.set_title("Percentage of Stocks Redundantly Filtered", fontsize='x-large')
    plot.set_xticklabels(list(results.columns), rotation=30)

plot_redundancy_heatmap(results);

Observations:

  • Our 'High Volume' filter has relatively low redundancy with the rest of our filters.
  • 'Primary Share', 'Known Market Cap', and 'Common Stock' all have very high relative redundancy. This is again due to the fact that these filters will all be False for assets without morningstar data.
  • Assets that are traded over the counter have a high likelihood (~80%) of having low trading volume. This is intuitively pretty plausible, and it lends some credence to our analysis.

Choosing a minimal subset

It's clear from our heatmap that there's a significant amount of overlap between our various filters. Consequently, a natural question that arises is "Can any of these filters be removed without changing the set of assets removed?"

We can removed unnecessary filters by computing a "covering set" on the columns of our frame with the following greedy algorithm:

Begin with an empty covering set.
While there are still rows rejected by at least one criterion:
    Choose the column rejecting the largest number of remaining rows.
    Add that column to the covering set.
    Remove all rows rejected by the new column.

Note that this isn't necessarily guaranteed to find us the smallest possible covering set (doing so would be a substantial challenge, but it's a good heuristic for doing so).

In [6]:
def covering_set(filter_results):
    """
    Recursively choose the column in `feature_sets` with the most False values,
    then remove all rows from the dataset for which that column was False.
    
    Repeat until no True values remain.
    
    Returns a list of (name, count) pairs showing the columns chosen and the 
    number of rows eliminated when the choice was made.
    """
    # It's again easier to write this in terms of what's filtered out.
    rejected = ~filter_results

    columns = []
    
    # While there are still values to reject:
    while len(rejected) and rejected.values.any():
        # Choose the column with the most rejections.
        nrejected = rejected.sum()
        new_column = nrejected.argmax()
        new_column_amount = nrejected[new_column]
        columns.append((new_column, new_column_amount))
        
        # Remove any rows rejected by that column.
        rejected = rejected.loc[~rejected[new_column]]
    return columns
In [7]:
covering_set(results)
Out[7]:
[('High Volume', 87247),
 ('Primary Share', 29921),
 ('Not LP (Company Name)', 1424),
 ('Known Market Cap', 1377),
 ('Not Depository Receipt', 1065),
 ('Not OTC', 445),
 ('Not LP (Balance Sheet)', 145)]

Observations:

  • Only 7 of our 9 filters are actually needed to produce an equivalent universe.
  • The vast majority of removals happen from High Volume and Primary Share.
  • There's still a long-ish tail of assets that require more narrowly-targeted filters.

Review

In this notebook, we implemented and analyzed methods for constructing a tradeable universe of assets.

  • We constructed filters using Pipeline's newly-added support for string columns.
  • We used Pandas and Seaborn to generate bar charts and heatmaps for visualizing filter outputs.
  • We implemented a simple greedy algorithm for removing redundant components from a suite of filters.
In [ ]: