Notebook

Factor Tearsheet

When implementing a factor in a trading algorithm, the complexity and wide range of parameters that come with basket selection and trading logic hinder our ability to evaluate the value factor's alpha signal in isolation. Before we proceed to the implementation of an algorithm, we want to know if the factor has any predictive value.

In this analysis, we'll measure a factor's predictive value using the Spearman rank correlation between the factor value and various N day forward price movement windows over a large universe of stocks. This correlation is called the Information Coefficient (IC). This tear sheet takes a pipeline factor and attempt to answer the following questions, in order:

  • What is the sector-neutral rolling mean IC for our different forward price windows?
  • What are the sector-neutral factor decile mean returns for our different forward price windows?
  • How much are the contents of the top and bottom quintile changing each day?
  • What is the autocorrelation in sector-wise factor rankings?
  • What is IC decay (difference in IC for different forward price windows) for each sector?
  • What is the IC decay for each sector over time?
  • What are the factor quintile returns for each sector?

For more information on Spearman Rank correlation, check out this notebook from the Quantopian lecture series.

In the plots that are not disagregated by sector, sector adjustment has been applied to forward price movements. You can think of this sector adjustment as incorperating the assumption of a sector-netural portfolio constraint. If we are equally weighted in each sector, we'd want our factor to help us compare stocks within their own sectors. For example, if AAPL 5-day forward return is 0.1% and the mean 5-day forward return for the Technology stocks in our universe was 0.5% in the same period, the sector adjusted 5 day return for AAPL in this period is -0.4%.

The autocorrelation and decile turnover figures are meant to be used as a measure of factor horizon. It is worth noting that these stats are potentially misleading, as our top X liquidity constraint makes our universe dynamic. This dynamic universe likely contributes to a higher quantile turnover and lower rank autocorrelation than we would see in a static universe.

In [12]:
from __future__ import division
from quantopian.pipeline import Pipeline
from quantopian.pipeline import CustomFactor
from quantopian.research import run_pipeline
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.factors import Latest, Returns, SimpleMovingAverage
import math
import datetime
import numpy as np
import pandas as pd
import scipy as sp
import pyfolio as pf
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

SECTOR_NAMES = {
 101: 'Basic Materials',
 102: 'Consumer Cyclical',
 103: 'Financial Services',
 104: 'Real Estate',
 205: 'Consumer Defensive',
 206: 'Healthcare',
 207: 'Utilities',
 308: 'Communication Services',
 309: 'Energy',
 310: 'Industrials',
 311: 'Technology' ,
}
In [13]:
class SidFactor(CustomFactor):
    """
    Workaround to screen by sids in pipeline
    """
    inputs = []  
    window_length = 1  
    sids = []  # list, tuple and whatever np.asarray accept

    def compute(self, today, assets, out):
        out[:] = np.in1d(assets, self.sids)

def create_sid_screen(sids):
    sid_factor = SidFactor()
    sid_factor.sids = sids 
    return sid_factor.eq(True)

class ReturnsMarketExcess(CustomFactor):
    """
    Calculates the percent change in close price (market adjusted) over the given window_length.
    **Default Inputs**: [USEquityPricing.close]
    """
    params = ('market_sid',)        
    inputs = [USEquityPricing.close]

    def compute(self, today, assets, out, close, market_sid):
        returns = (close[-1] - close[0]) / close[0]
        market_idx = assets.get_loc(market_sid)
        returns -= returns[market_idx] # remove market returns
        out[:] = returns
        
def _beta(stock_prices, bench_prices):
    # `linregress` returns its results in the following order:
    # slope, intercept, r-value, p-value, stderr
    regr_results = stats.linregress(y=bench_prices, x=stock_prices)
    
    #alpha = regr_results[1]
    beta = regr_results[0]
    #r_value = regr_results[2]
    p_value = regr_results[3]
    #stderr = regr_results[4]  
    
    # Check null hypothesis
    if p_value > 0.05:
        beta = 0.        
    return beta
        
class ReturnsBetaExcess(CustomFactor):
    """
    Calculates the percent change in close price (beta adjusted) over the given window_length.
    **Default Inputs**: [USEquityPricing.close]
    """
    params = ('delta_days', 'market_sid',)        
    inputs = [USEquityPricing.close]
    window_length = 60
    
    def compute(self, today, assets, out, close, delta_days, market_sid):
        returns = (close[delta_days:] - close[:-delta_days]) # absolute returns
        returns /= close[:-delta_days]                       # percentage returns
        market_idx = assets.get_loc(market_sid)
        market_returns = returns[:,market_idx]
        betas = np.apply_along_axis(_beta, 0, returns, market_returns)
        returns -= (returns[:, [market_idx] ] * betas) # remove returns due to beta
        out[:] = returns[-1]
               
def get_daily_perc_ret(sid_universe, start_date, end_date, ret_type='normal', market=None):
    """
    Creates a DataFrame containing daily percentage returns: normal, market_excess or beta_excess returns
    
    Parameters
    ----------
    sid_universe : list
        List of sids for which the ruturns are computed.
    start_date : string
        Starting date for returns computation.
    end_date : string
        End date for returns computation.
    ret_type: string
        Type of returns: normal, market_excess or beta_excess
    market: equity
        The market, if None is passed 'SPY' will be used
    """
        
    if market is None:
        market = symbols('SPY')
    
    mask = create_sid_screen( sid_universe + [market.sid] )
    inputs = [USEquityPricing.open]
    
    daily_ret_columns = {
    'normal':        Returns(inputs=inputs, window_length=2, mask=mask),
    'market_excess': ReturnsMarketExcess(inputs=inputs, window_length=2, market_sid=market.sid, mask=mask),
    'beta_excess':   ReturnsBetaExcess(inputs=inputs, window_length=60, mask=mask, delta_days=2, market_sid=market.sid),
    }
    
    tmp_pipe = Pipeline( columns={'daily_perc_ret':daily_ret_columns[ret_type]} )
    tmp_pipe.set_screen(mask)
    
    perc_ret_pipe = run_pipeline(tmp_pipe, start_date=start_date, end_date=end_date)
    
    perc_ret_pipe = perc_ret_pipe.unstack().fillna(0)
    perc_ret_pipe = perc_ret_pipe['daily_perc_ret'] #this drop top level column
    
    # pipeline is run in the morning of each days before having open/close price for that day,
    # this means we need to shift the return to have that day return
    perc_ret_pipe = perc_ret_pipe.shift(-1)[:-1]
    
    return perc_ret_pipe
In [14]:
class Liquidity(CustomFactor):   
    inputs = [USEquityPricing.volume, USEquityPricing.close] 
    window_length = 5

    def compute(self, today, assets, out, volume, close): 
        out[:] = (volume * close).mean(axis=0)
        
class Sector(CustomFactor):
    inputs = [morningstar.asset_classification.morningstar_sector_code]
    window_length = 1
    def compute(self, today, assets, out, msc):
        out[:] = msc[-1]

def construct_factor_history(factor_cls, start_date='2015-10-1', end_date='2016-2-1', 
                             factor_name='factor',
                             top_liquid=1000, universe_constraints=None, sector_names=None):
    """
    Creates a DataFrame containing daily factor values and sector codes for a liquidity 
    constrained universe. The returned DataFrame is can be used in the factor tear sheet.

    Parameters
    ----------
    factor_cls : quantopian.pipeline.CustomFactor
        Factor class to be computed.
    start_date : string or pd.datetime
        Starting date for factor computation.
    end_date : string or pd.datetime
        End date for factor computation.
    factor_name : string, optional
        Column name for factor column in returned DataFrame.
    top_liquid : int, optional
        Limit universe to the top N most liquid names each trading day.
        Based on trailing 5 days traded dollar volume. 
    universe_constraints : num_expr, optional
        Pipeline universe constraint.

    Returns
    -------
    daily_factor : pd.DataFrame
        DataFrame with integer index and date, equity, factor, and sector
        code columns.

    """
    price = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=22)
    price_filter = (price >= 5.0)
    
    liquidity = Liquidity(mask=price_filter)
    liquidity_rank = liquidity.rank(ascending=False)
        
    ok_universe = (top_liquid > liquidity_rank)
    
    if universe_constraints is not None:
        ok_universe = ok_universe & universe_constraints
    
    factor = factor_cls(mask=ok_universe)
    sector = Sector(mask=ok_universe)
    
    ok_universe = ok_universe & factor.eq(factor) & sector.eq(sector)
    
    pipe = Pipeline()
    pipe.add(factor, factor_name)
    pipe.add(sector, 'sector_code')
    pipe.set_screen(ok_universe)
    
    daily_factor = run_pipeline(pipe, start_date=start_date, end_date=end_date)
    daily_factor = daily_factor.reset_index().rename(
        columns={'level_0': 'date', 'level_1':'equity'})
    
    daily_factor = daily_factor[daily_factor.sector_code != -1]
    if sector_names is not None:
        daily_factor.sector_code = daily_factor.sector_code.apply(
            lambda x: sector_names[x])
    
    return daily_factor


def get_daily_returns(daily_factor, start_date, end_date, extra_days_before=0, extra_days_after=0, ret_type='normal'):
    """
    Creates a DataFrame containing daily percentage returns: normal, market_excess or beta_excess returns
    
    Parameters
    ----------
    daily_factor : pd.DataFrame
        DataFrame with, at minimum, date, equity, factor, columns. Index can be integer or
        date/equity multiIndex.
        See construct_factor_history for more detail. 
    start_date : string
        Starting date for returns computation.
    end_date : string
        End date for returns computation.
    ret_type: string
        Type of returns: normal, market_excess or beta_excess
        
    Returns
    -------
    price : pd.DataFrame
        DataFrame with date index and equity columns with and percentage price movement that can be
        normal, market excess or beta excess
    """   
    extra_days = math.ceil(extra_days_before * 365.0/252.0) + 5 # just to be sure
    start_date = datetime.datetime.strptime(start_date, "%Y-%m-%d") - datetime.timedelta(days=extra_days)
    start_date = start_date.strftime("%Y-%m-%d")
    
    extra_days = math.ceil(extra_days_after * 365.0/252.0) + 5 # just to be sure
    end_date = datetime.datetime.strptime(end_date, "%Y-%m-%d") + datetime.timedelta(days=extra_days)
    end_date = end_date.strftime("%Y-%m-%d")
    
    sid_universe = daily_factor['equity'].unique()
    sid_universe = map(lambda x:x.sid, sid_universe)
    
    daily_perc_ret = get_daily_perc_ret(sid_universe=sid_universe, start_date=start_date, end_date=end_date,
                                        ret_type=ret_type)
    
    return daily_perc_ret


def add_forward_price_movement(daily_factor, days=[1, 5, 10], prices=None):
    """
    Adds N day forward price movements (as percent change) to a factor value
    DataFrame. 

    Parameters
    ----------
    daily_factor : pd.DataFrame
        DataFrame with, at minimum, date, equity, factor, columns. Index can be integer or
        date/equity multiIndex.
        See construct_factor_history for more detail. 
    days : list
        Number of days forward to project price movement. One column will be added for each value. 
    prices : pd.DataFrame, optional
        Pricing data to use in forward price calculation. Equities as columns, dates as index.
        If no value is passed, get pricing will be called.

    Returns
    -------
    factor_and_fp : pd.DataFrame
        DataFrame with integer index and date, equity, factor, sector
        code columns with and an arbitary number of N day forward percentage
        price movement columns.

    """
    factor_and_fp = daily_factor.copy()
    if not isinstance(factor_and_fp.index, pd.core.index.MultiIndex):
        factor_and_fp = factor_and_fp.set_index(['date', 'equity'])
    
    if prices is None:
        start_date = factor_and_fp.index.levels[0].values.min()
        end_date = factor_and_fp.index.levels[0].values.max()
        
        equities = factor_and_fp.index.levels[1].unique()

        time_buffer = pd.Timedelta(days=max(days)+5)
        prices = get_pricing(equities, 
                             start_date=start_date, 
                             end_date=end_date+time_buffer, 
                             fields='open_price')
        
    col_n = '%s_day_fwd_price_change'
    for i in days:
        delta = prices.pct_change(i).shift(-i)
        factor_and_fp[col_n%i] = delta.stack()

    factor_and_fp = factor_and_fp.reset_index()
    
    return factor_and_fp


def sector_adjust_forward_price_moves(factor_and_fp):
    """
    Convert forward price movements to price movements relative to mean sector price movements.
    This normalization incorperates the assumption of a sector neutral portfolio constraint
    and thus allows allows the factor to be evaluated across sectors. 
    
    For example, if AAPL 5 day return is 0.1% and mean 5 day return for the Technology stocks 
    in our universe was 0.5% in the same period, the sector adjusted 5 day return for AAPL
    in this period is -0.4%.
    
    
    Parameters
    ----------
    factor_and_fp : pd.DataFrame
        DataFrame with date, equity, factor, and forward price movement columns. Index should be integer.
        See add_forward_price_movement for more detail. 
        
    Returns
    -------
    adj_factor_and_fp : pd.DataFrame
        DataFrame with integer index and date, equity, factor, sector
        code columns with and an arbitary number of N day forward percentage
        price movement columns, each normalized by sector.
    
    """
    adj_factor_and_fp = factor_and_fp.copy()
    pc_cols = [col for col in factor_and_fp.columns.values if 'fwd_price_change' in col]
    
    adj_factor_and_fp[pc_cols] = factor_and_fp.groupby(['date', 'sector_code'])[pc_cols].apply(
             lambda x: x - x.mean())
    
    return adj_factor_and_fp


def factor_spearman_rank_IC(factor_and_fp, time_rule=None, by_sector=True, factor_name='factor'):
    """
    Computes sector neutral Spearman Rank Correlation based Information Coefficient between 
    factor values and N day forward price movements. 

    Parameters
    ----------
    factor_and_fp : pd.DataFrame
        DataFrame with date, equity, factor, and forward price movement columns. Index should be integer.
        See add_forward_price_movement for more detail. 
    time_rule : string, optional
        Time span to use in Pandas DateTimeIndex grouping reduction.
        See http://pandas.pydata.org/pandas-docs/stable/timeseries.html for available options.
    by_sector : boolean
        If True, compute ic separately for each sector 
    factor_name : string
        Name of factor column on which to compute IC.

    Returns
    -------
    ic : pd.DataFrame
        Spearman Rank correlation between factor and provided forward price movement columns. 
        MultiIndex of date, sector.
    err : pd.DataFrame
        Standard error of computed IC. MultiIndex of date, sector.
        MultiIndex of date, sector.
    
    """
    def src_ic(x):
        cn = "%s_day_IC"
        ic = pd.Series()
        for days, col in zip(fwd_days, pc_cols):
            ic[cn%days] = sp.stats.spearmanr(x[factor_name], x[col])[0]
        ic['obs_count'] = len(x)
            
        return ic
    
    def src_std_error(rho, n):
        return np.sqrt((1-rho**2)/(n-2))
    
    
    fwd_days, pc_cols = get_price_move_cols(factor_and_fp)
    
    grpr = ['date', 'sector_code'] if by_sector else ['date']
    ic = factor_and_fp.groupby(grpr).apply(src_ic)
    
    obs_count = ic.pop('obs_count')
    err = ic.apply(lambda x: src_std_error(x, obs_count))

    if time_rule is not None:
        ic = ic.reset_index().set_index('date')
        err = err.reset_index().set_index('date')
        
        grpr = [pd.TimeGrouper(time_rule),'sector_code'] if by_sector else [pd.TimeGrouper(time_rule)]
        ic = ic.groupby(grpr).mean()
        err = err.groupby(grpr).agg(
            lambda x: np.sqrt((np.sum(np.power(x, 2))/len(x))))
        
    else:
        if by_sector:
            ic = ic.reset_index().groupby(['sector_code']).mean()
            err = err.reset_index().groupby(['sector_code']).agg(
                lambda x: np.sqrt((np.sum(np.power(x, 2))/len(x))))    

    return ic, err


def quantile_bucket_factor(factor_and_fp, by_sector=True, quantiles=5, factor_name='factor'):
    """
    Computes daily factor quantiles.
    
    Parameters
    ----------
    factor_and_fp : pd.DataFrame
        DataFrame with date, equity, factor, and forward price movement columns. 
        Index should be integer. See add_forward_price_movement for more detail. 
    by_sector : boolean
        If True, compute quantile buckets separately for each sector.
    quantiles : integer
        Number of quantiles buckets to use in factor bucketing.
    factor_name : string
        Name of factor column on which to compute quantiles.

    Returns
    -------
    factor_and_fp_ : pd.DataFrame
        Factor and forward price movements with additional factor quantile column.
    """

    g_by = ['date', 'sector_code'] if by_sector else ['date']
    factor_and_fp_ = factor_and_fp.copy()

    factor_and_fp_['factor_percentile'] = factor_and_fp_.groupby(
                g_by)[factor_name].rank(pct=True)
    
    q_int_width = 1. / quantiles
    factor_and_fp_['factor_bucket'] = factor_and_fp_.factor_percentile.apply(
        lambda x: ((x - .000000001) // q_int_width) + 1)
    
    
    return factor_and_fp_
  
    
def quantile_bucket_mean_daily_return(quantile_factor, by_sector=False):
    """
    Computes mean daily returns for factor quantiles across provided forward 
    price movement columns.
    
    Parameters
    ----------
    quantile_factor : pd.DataFrame
        DataFrame with date, equity, factor, factor quantile, and forward price movement columns. 
        Index should be integer. See quantile_bucket_factor for more detail. 
    by_sector : boolean
        If True, compute quintile bucket returns separately for each sector 
    quantiles : integer
        Number of quantiles buckets to use in factor bucketing.


    Returns
    -------
    mean_returns_by_quantile : pd.DataFrame
        Sector-wise mean daily returns by specified factor quantile.
    """
    fwd_days, pc_cols = get_price_move_cols(quantile_factor)
    
    def daily_mean_ret(x):
        mean_ret = pd.Series()
        for days, col in zip(fwd_days, pc_cols):
            mean_ret[col] = x[col].mean() #/ days
            
        return mean_ret
    
    g_by = ['sector_code', 'factor_bucket'] if by_sector else ['factor_bucket']
    mean_ret_by_quantile = quantile_factor.groupby(
            g_by)[pc_cols].apply(daily_mean_ret)
    
    return mean_ret_by_quantile
    
    
def quantile_turnover(quantile_factor, quantile):
    """
    Computes the proportion of names in a factor quantile that were 
    not in that quantile in the previous period.
    
    Parameters
    ----------
    quantile_factor : pd.DataFrame
        DataFrame with date, equity, factor, factor quantile, and forward price movement columns. 
        Index should be integer. See quantile_bucket_factor for more detail. 
    quantile : integer
        Quantile on which to perform turnover analysis.
        
    Returns
    -------
    quant_turnover : pd.Series
        Period by period turnover for that quantile.
    """
    
    quant_names = quantile_factor[quantile_factor.factor_bucket == quantile]
    
    quant_name_sets = quant_names.groupby(['date']).equity.apply(set)
    new_names = (quant_name_sets - quant_name_sets.shift(1)).dropna()
    quant_turnover = new_names.apply(lambda x: len(x)) / quant_name_sets.apply(lambda x: len(x))
    
    return quant_turnover


def factor_rank_autocorrelation(daily_factor, time_rule='W', factor_name='factor'):
    """
    Computes autocorrelation of mean factor ranks in specified timespans.
    We must compare week to week factor ranks rather than factor values to account for
    systematic shifts in the factor values of all names or names within a sector.
    This metric is useful for measuring the turnover of a factor. If the value of a factor
    for each name changes randomly from week to week, we'd expect a weekly autocorrelation of 0.
    
    Parameters
    ----------
    daily_factor : pd.DataFrame
        DataFrame with integer index and date, equity, factor, and sector
        code columns.
    time_rule : string, optional
        Time span to use in factor grouping mean reduction.
        See http://pandas.pydata.org/pandas-docs/stable/timeseries.html for available options.
    factor_name : string
        Name of factor column on which to compute IC.

    Returns
    -------
    autocorr : pd.Series
        Rolling 1 period (defined by time_rule) autocorrelation of factor values.
    
    """
    daily_ranks = daily_factor.copy()
    daily_ranks[factor_name] = daily_factor.groupby(['date', 'sector_code'])[factor_name].apply(
        lambda x: x.rank(ascending=True))

    equity_factor = daily_ranks.pivot(index='date', columns='equity', values=factor_name)
    if time_rule is not None:
        equity_factor = equity_factor.resample(time_rule, how='mean')

    autocorr = equity_factor.corrwith(equity_factor.shift(1), axis=1)
    
    return autocorr
    
    
def get_price_move_cols(x):
    pc_cols = [col for col in x.columns.values if 'fwd_price_change' in col]
    fwd_days = map(lambda x: int(x.split('_')[0]), pc_cols)
    
    return fwd_days, pc_cols

def get_ic_cols(x):
    return [col for col in x.columns.values if 'day_IC' in col]


def is_outlier(points, thresh=3.0):
    """
    Utility function to remove outliers, for a better graph visualization
    Returns a boolean array with True if points are outliers and False 
    otherwise.

    Parameters:
    -----------
        points : An numobservations by numdimensions array of observations
        thresh : The modified z-score to use as a threshold. Observations with
            a modified z-score (based on the median absolute deviation) greater
            than this value will be classified as outliers.

    Returns:
    --------
        mask : A numobservations-length boolean array.

    References:
    ----------
        Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
        Handle Outliers", The ASQC Basic References in Quality Control:
        Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. 
    """
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh


def build_cumulative_returns_series(factor_and_fp, daily_perc_ret, days_before, days_after, day_zero_align=False):
    """
    An equity and date pair is extracted from each row in the input dataframe and for each of 
    these pairs a cumulative return time series is built starting 'days_before' days
    before and ending 'days_after' days after the date specified in the pair
    
    Parameters
    ----------
    factor_and_fp : pd.DataFrame
        DataFrame with at least date and equity columns.
    daily_perc_ret : pd.DataFrame
        Pricing data to use in cumulative return calculation. Equities as columns, dates as index.
    day_zero_align : boolean
         Aling returns at day 0 (timeseries is 0 at day 0)
    """
    
    ret_df = pd.DataFrame()

    for index, row in factor_and_fp.iterrows():
        timestamp, equity = row['date'], row['equity']
        timestamp_idx = daily_perc_ret.index.get_loc(timestamp)
        start = timestamp_idx - days_before
        end   = timestamp_idx + days_after
        series = daily_perc_ret.ix[start:end, equity]
        ret_df = pd.concat( [ret_df, series], axis=1, ignore_index=True)

    # Reset index to have the same starting point (from datetime to day offset)
    ret_df = ret_df.apply(lambda x : x.dropna().reset_index(drop=True), axis=0)
    ret_df.index = range(-days_before, days_after)
    
    # From daily percent returns to comulative returns
    ret_df  = (ret_df  + 1).cumprod() - 1
    
    # Make returns be 0 at day 0
    if day_zero_align:
        ret_df -= ret_df.iloc[days_before,:]
    
    return ret_df
In [15]:
def plot_daily_ic(factor_and_fp, factor_name='factor'):
    """
    Plots Spearman Rank Information Coefficient and IC moving average for a given factor. 
    Sector neturalization of forward price movements with sector_adjust_forward_price_moves is
    recommended.
    
    Parameters
    ----------
    factor_and_fp : pd.DataFrame
        DataFrame with date, equity, factor, and forward price movement columns. 
    factor_name : string
        Name of factor column on which to compute IC.
    """ 
    
    daily_ic, _ = factor_spearman_rank_IC(factor_and_fp, by_sector=False, 
                                          factor_name=factor_name)
    ic_cols = get_ic_cols(daily_ic)
    for col in ic_cols:
        mean_ic = daily_ic[col].mean()
        std_ic = daily_ic[col].std()
        fp_ic = pd.DataFrame(daily_ic[col])
        fp_ic['1 month moving avg'] = pd.rolling_mean(fp_ic[col], 22)
        fp_ic.plot(title= "{} {} (sector adjusted)".format(factor_name, col), figsize=(20,10))
        print('{} mean: {}'.format(col, mean_ic))
        print('{} stdev: {}'.format(col, std_ic))
        print('{} mean/stdev: {}'.format(col, mean_ic/std_ic))
        plt.ylabel('IC')
        plt.xlabel('date')
        plt.show()
        sns.distplot(daily_ic[col].replace(np.nan, 0) ,norm_hist=True)
        plt.show()
        
        

def plot_ic_by_sector(factor_and_fp, factor_name='factor'):
    """
    Plots Spearman Rank Information Coefficient for a given factor over provided forward price
    movement windows. Separates by sector.
    
    Parameters
    ----------
    factor_and_fp : pd.DataFrame
        DataFrame with date, equity, factor, and forward price movement columns. 
    factor_name : string
        Name of factor column on which to compute IC.
    """
    ic_sector, err_sector = factor_spearman_rank_IC(factor_and_fp, factor_name=factor_name)

    ic_sector.plot(kind='bar' ) #yerr=err_sector
    fig = plt.gcf()
    fig.suptitle("Information Coefficient by Sector", fontsize=16, x=.5, y=.93)
    plt.show()
    
    
def plot_ic_by_sector_over_time(factor_and_fp, time_rule=None, factor_name='factor'):
    """
    Plots sector-wise time window mean daily Spearman Rank Information Coefficient 
    for a given factor over provided forward price movement windows.
    
    Parameters
    ----------
    factor_and_fp : pd.DataFrame
        DataFrame with date, equity, factor, and forward price movement columns. 
    time_rule : string, optional
        Time span to use in time grouping reduction.
        See http://pandas.pydata.org/pandas-docs/stable/timeseries.html for available options.
    factor_name : string
        Name of factor column on which to compute IC.
    """
    ic_time, err_time = factor_spearman_rank_IC(factor_and_fp, time_rule=time_rule,
                                                factor_name=factor_name)
    ic_time = ic_time.reset_index()
    err_time = err_time.reset_index()

    f, axes = plt.subplots(6,2, sharex=False, sharey=True, figsize=(20,45))
    axes = axes.flatten()
    i = 0
    for sc, data in ic_time.groupby(['sector_code']):
        e = err_time[err_time.sector_code == sc].set_index('date')
        data.drop('sector_code', axis=1).set_index('date').plot(kind='bar', 
                                                                title=sc, 
                                                                ax=axes[i],
                                                                ) # yerr=e
        i+=1
    fig = plt.gcf()
    fig.suptitle("Monthly Information Coefficient by Sector", fontsize=16, x=.5, y=.93)
    plt.show()
    
def plot_quantile_returns(factor_and_fp, by_sector=True, quantiles=5, factor_name='factor'):
    """
    Plots sector-wise mean daily returns for factor quantiles 
    across provided forward price movement columns.
    
    Parameters
    ----------
    factor_and_fp : pd.DataFrame
        DataFrame with date, equity, factor, and forward price movement columns.
    by_sector : boolean
        Disagregate figures by sector.
    quantiles : integer
        Number of quantiles buckets to use in factor bucketing.
    factor_name : string
        Name of factor column on which to compute IC.
    """
    decile_factor = quantile_bucket_factor(factor_and_fp, by_sector=by_sector, quantiles=quantiles, 
                                           factor_name=factor_name)
    mean_ret_by_q = quantile_bucket_mean_daily_return(decile_factor, by_sector=by_sector)
    
    if by_sector:
        f, axes = plt.subplots(6,2, sharex=False, sharey=True, figsize=(20,45))
        axes = axes.flatten()
        i = 0
        for sc, cor in mean_ret_by_q.groupby(level='sector_code'):
            cor = cor.reset_index().drop('sector_code', axis=1).set_index('factor_bucket')
            cor.plot(kind='bar', title=sc, ax=axes[i])
            axes[i].set_xlabel('factor quantile')
            axes[i].set_ylabel('mean price % change')
            i+=1
        fig = plt.gcf()
        fig.suptitle(factor_name + ": Mean Return By Factor Quantile", fontsize=24, x=.5, y=.93)
            
    else:
        mean_ret_by_q.plot(kind='bar', 
                           title="Mean Return By Factor Quantile (sector adjusted)")
        plt.xlabel('factor quantile')
        plt.ylabel('mean daily price % change')
        
    plt.show()

def plot_quantile_returns_box(factor_and_fp, by_sector=True, quantiles=5, factor_name='factor'):
    """
    Plots sector-wise mean daily returns as boxplot for factor quantiles 
    across provided forward price movement columns.
    
    Parameters
    ----------
    factor_and_fp : pd.DataFrame
        DataFrame with date, equity, factor, and forward price movement columns.
    by_sector : boolean
        Disagregate figures by sector.
    quantiles : integer
        Number of quantiles buckets to use in factor bucketing.
    factor_name : string
        Name of factor column on which to compute IC.
    """
    decile_factor = quantile_bucket_factor(factor_and_fp, by_sector=by_sector, quantiles=quantiles, 
                                           factor_name=factor_name)
    fwd_days, pc_cols = get_price_move_cols(decile_factor)
    
    if by_sector:
        f, axes = plt.subplots(6,2, sharex=False, sharey=True, figsize=(20,45))
        axes = axes.flatten()
        i = 0
        for sc, cor in decile_factor.groupby(by='sector_code'):
            cor_box_plot = pd.melt(cor, var_name='fwd_days_price', value_name='%_price_change',
                                             id_vars=['factor_bucket'], value_vars=pc_cols)
            # boxplot doesn't sort 'x' by itself
            cor_box_plot = cor_box_plot.sort(columns='factor_bucket', ascending=True)
            sns.boxplot(ax=axes[i], x="factor_bucket", y="%_price_change", hue="fwd_days_price",
                        hue_order=pc_cols, data=cor_box_plot)
            
            axes[i].set_xlabel('factor quantile')
            axes[i].set_ylabel('mean price % change')
            axes[i].set_title(sc)
            i+=1
        fig = plt.gcf()
        fig.suptitle(factor_name + ": Mean Return By Factor Quantile", fontsize=24, x=.5, y=.93)
            
    else:

        decile_factor_box_plot = pd.melt(decile_factor, var_name='fwd_days_price', value_name='%_price_change',
                                 id_vars=['factor_bucket'], value_vars=pc_cols)
        # boxplot doesn't sort 'x' by itself
        decile_factor_box_plot = decile_factor_box_plot.sort(columns='factor_bucket', ascending=True)
        sns.boxplot(x="factor_bucket", y="%_price_change", hue="fwd_days_price", 
                    hue_order=pc_cols, data=decile_factor_box_plot)
        
        plt.title("Mean Return By Factor Quantile (sector adjusted)")
        plt.xlabel('factor quantile')
        plt.ylabel('mean price % change')
        
    plt.show()


def plot_factor_rank_auto_correlation(daily_factor, time_rule='W', factor_name='factor'):
    """
    Plots factor rank autocorrelation over time. See factor_rank_autocorrelation for more details.
    
    Parameters
    ----------
    daily_factor : pd.DataFrame
        DataFrame with date, equity, and factor value columns. 
    time_rule : string, optional
        Time span to use in time grouping reduction prior to autocorrelation calculation.
        See http://pandas.pydata.org/pandas-docs/stable/timeseries.html for available options.
    factor_name : string
        Name of factor column on which to compute IC.
    """ 
    
    fa = factor_rank_autocorrelation(daily_factor, time_rule=time_rule, factor_name=factor_name)
    print "Mean rank autocorrelation: " + str(fa.mean())
    fa.plot(title='Week-to-Week Factor Rank Autocorrelation')
    plt.ylabel('autocorrelation coefficient')
    plt.show()
    
def plot_top_bottom_quantile_turnover(daily_factor, num_quantiles=5, factor_name='factor'):
    """
    Plots daily top and bottom quantile factor turnover. See quantile_bucket_factor for more
    details.
    
    Parameters
    ----------
    daily_factor : pd.DataFrame
        DataFrame with date, equity, and factor value columns. 
    num_quantiles : integer
        Number of quantiles to use in quantile bucketing.
    factor_name : string
        Name of factor column on which to compute IC.
    """ 
    
    quint_buckets = quantile_bucket_factor(daily_factor, by_sector=True, 
                                           quantiles=5, factor_name=factor_name)
    turnover = pd.DataFrame()
    turnover['top quintile turnover'] = quantile_turnover(quint_buckets, num_quantiles)
    turnover['bottom quintile turnover'] = quantile_turnover(quint_buckets, 1)
    
    turnover.plot(title='Top and Bottom Quintile Turnover (Quantiles Computed by Sector)')
    plt.ylabel('proportion of names not present in quantile in previous period')
    plt.show()


def plot_factor_vs_fwdprice_distribution(factor_and_fp, factor_name='factor', remove_outliers = False):
    """
    Plots distribuion of factor vs forward price.
    This is useful to visually spot linear or non linear relationship between factor and fwd prices
    
    Parameters
    ----------
    factor_and_fp : pd.DataFrame
        DataFrame with date, equity, factor, and forward price movement columns.
    factor_name : string
        Name of factor column
    remove_outliers : boolean
        Remove outliers before plotting the distribution
    """
    fwd_days, pc_cols = get_price_move_cols(factor_and_fp)
    for col in pc_cols:
        
        if remove_outliers:
            data = factor_and_fp[ ~is_outlier(factor_and_fp[factor_name].values) & \
                                  ~is_outlier(factor_and_fp[col].values)]
        else:
            data = factor_and_fp
               
        jg = sns.jointplot(data[factor_name], data[col], kind="kde")
        jg.fig.suptitle('Factor/returns kernel density estimation' +
                        (' NO OUTLIERS' if remove_outliers else '') )
        plt.show()
  
        jg = sns.jointplot(data[factor_name], data[col], kind="reg")
        jg.fig.suptitle('Factor/returns regression' +
                        (' NO OUTLIERS' if remove_outliers else '') )
        
        plt.show()



def plot_quantile_cumulative_return(factor_and_fp, daily_perc_ret, quantiles=5, by_quantile = False,
                                    factor_name='factor', days_before = 15, days_after = 15,
                                    std_bar = True, day_zero_align = True):
    """
    Plots sector-wise mean daily returns for factor quantiles 
    across provided forward price movement columns.
    
    Parameters
    ----------
    factor_and_fp : pd.DataFrame
        DataFrame with date, equity, factor, and forward price movement columns.
    daily_perc_ret : pd.DataFrame
        Pricing data to use in cumulative return calculation. Equities as columns, dates as index.
    quantiles : integer
        Number of quantiles buckets to use in factor bucketing.
    by_quantile : boolean
        Disagregate figures by quantile.
    factor_name : string
        Name of factor column on which to compute IC.
    days_before : int
        How many days to plot before the factor is calculated
    days_after  : int
        How many days to plot after the factor is calculated
    day_zero_align : boolean
         Aling returns at day 0 (timeseries is 0 at day 0)
    std_bar : boolean
        Plot standard deviation plot
    """
    decile_factor = quantile_bucket_factor(factor_and_fp, by_sector=False, quantiles=quantiles, 
                                           factor_name=factor_name)
            
    cumulative_returns = {}

    for fb, cor in decile_factor.groupby(by='factor_bucket'):
        cumulative_returns[fb] = build_cumulative_returns_series(cor, daily_perc_ret, days_before, days_after, day_zero_align)
    
    palette = sns.color_palette("coolwarm", len(cumulative_returns))
    
    if by_quantile:
        nrows=int(round(len(cumulative_returns)/2.0))
        ncols=2
        fig, axes = plt.subplots(nrows, ncols, figsize=(20,45))
        axes = axes.flatten()
        i = 0
        for fb, ret_df in cumulative_returns.iteritems():
            # plot cumulative returs
            label = 'Quantile ' + str(fb)
            sns.tsplot(ax=axes[i], data=ret_df.T.values, condition=label, legend=True, color=palette[i], time=ret_df.index)
                       # , err_style="unit_traces") for single traces
            axes[i].set_ylabel('% return')
            # plot std dev bars
            if std_bar:
                mean = ret_df.mean(axis=1)
                std  = ret_df.std(axis=1)
                axes[i].errorbar(ret_df.index, mean, yerr=std, fmt='-o', color=palette[i])
            # mark day zero with a vertical line
            axes[i].axvline(x=0, color='k', linestyle='--')
            i+=1
        fig = plt.gcf()
        fig.suptitle("Cumulative returns by quantile", fontsize=24, x=.5, y=.93)
            
    else:
        # plot cumulative returs
        ax = None
        i = 0
        for fb, ret_df in cumulative_returns.iteritems():
            label = 'Quantile ' + str(fb)
            ax = sns.tsplot(ax=ax, data=ret_df.T.values, condition=label, legend=True, color=palette[i], time=ret_df.index)
                            # , err_style="unit_traces") for single traces
            # plot std dev bars
            if std_bar:
                mean = ret_df.mean(axis=1)
                std  = ret_df.std(axis=1)
                ax.errorbar(ret_df.index, mean, yerr=std, fmt='-o', color=palette[i])
            i+=1
        # mark day zero with a vertical line
        ax.axvline(x=0, color='k', linestyle='--')
        plt.xlabel('Days')
        plt.ylabel('% return')
        plt.title("Cumulative returns by quantile")
        
    plt.show()
In [16]:
def create_factor_tear_sheet(factor_cls, 
                             factor_name='factor',
                             start_date='2015-10-1', 
                             end_date='2016-2-1',
                             top_liquid=1000,
                             sector_names=None,
                             days=[1, 5, 10],
                             nquantiles = 10,
                             ret_type='normal', # normal, market_excess or beta_excess
                             days_before=36,
                             days_after=20,
                            ):

    factor = construct_factor_history(factor_cls, start_date=start_date, end_date=end_date, 
                                      factor_name=factor_name, top_liquid=top_liquid, 
                                      sector_names=sector_names)
    daily_perc_ret = get_daily_returns(factor, start_date, end_date, extra_days_before=days_before,
                                       extra_days_after=days_after, ret_type=ret_type)
    factor_and_fp = add_forward_price_movement(factor, days=days, prices=(daily_perc_ret+1.0).cumprod())

    pattern_dict = {
      -2 : 'HS',   
      2  : 'IHS', 
      -1 : 'BTOP', 
      1  : 'BBOT', 
      -4 : 'TTOP', 
      4  : 'TBOT', 
      -3 : 'RTOP', 
      3  : 'RBOT', 
    }


    for pattern, df in factor_and_fp.groupby(by='PatternFactor'):
        print "Pattern ", pattern_dict[pattern], " entries ", len(df.index)
        
        # Plot comulative returns over time for each quantile
        plot_quantile_cumulative_return(df, daily_perc_ret, quantiles=nquantiles, by_quantile=False, 
                        factor_name=factor_name, days_before=days_before, days_after=days_after, std_bar=False)
        plot_quantile_cumulative_return(df, daily_perc_ret, quantiles=nquantiles, by_quantile=False, 
                        factor_name=factor_name, days_before=2, days_after=max(days)+1, std_bar=True)
        # What are the sector-neutral factor decile mean returns for our different forward price windows? 
        plot_quantile_returns(df, by_sector=False, quantiles=nquantiles, factor_name=factor_name)
        # As above but more detailed, we want to know the volatility of returns
        plot_quantile_returns_box(df, by_sector=False, quantiles=nquantiles, factor_name=factor_name)

Factor Definition

Start by defining a pipeline factor.

In [17]:
from __future__ import division
from statsmodels.nonparametric.kernel_regression import KernelReg
from numpy import linspace
from scipy.signal import argrelextrema
from collections import defaultdict


def find_max_min(prices):
    prices_ = prices.copy()
    prices_.index = linspace(1., len(prices_), len(prices_))
    kr = KernelReg([prices_.values], [prices_.index.values], var_type='c', bw=[1.8,1])
    f = kr.fit([prices_.index.values])
    smooth_prices = pd.Series(data=f[0], index=prices.index)
    
    local_max = argrelextrema(smooth_prices.values, np.greater)[0]
    local_min = argrelextrema(smooth_prices.values, np.less)[0]
    
    price_local_max_dt = []
    for i in local_max:
        if (i>1) and (i<len(prices)-1):
            price_local_max_dt.append(prices.iloc[i-2:i+2].argmax())

    price_local_min_dt = []
    for i in local_min:
        if (i>1) and (i<len(prices)-1):
            price_local_min_dt.append(prices.iloc[i-2:i+2].argmin())
        
    prices.name = 'price'
    maxima = pd.DataFrame(prices.loc[price_local_max_dt])
    minima = pd.DataFrame(prices.loc[price_local_min_dt])
    max_min = pd.concat([maxima, minima]).sort_index()
    max_min.index.name = 'date'
    max_min = max_min.reset_index()
    max_min = max_min[~max_min.date.duplicated()]
    p = prices.reset_index()
    max_min['day_num'] = p[p['index'].isin(max_min.date)].index.values
    max_min = max_min.set_index('day_num').price
    
    return max_min


def find_patterns(max_min):
    patterns = defaultdict(list)

    for i in range(5, len(max_min)+1):
        window = max_min.iloc[i-5:i]

        # pattern must play out in less than 36 days
        if window.index[-1] - window.index[0] > 35:
            continue

        # Using the notation from the paper to avoid mistakes
        e1 = window.iloc[0]
        e2 = window.iloc[1]
        e3 = window.iloc[2]
        e4 = window.iloc[3]
        e5 = window.iloc[4]

        rtop_g1 = np.mean([e1,e3,e5])
        rtop_g2 = np.mean([e2,e4])
        # Head and Shoulders
        if (e1 > e2) and (e3 > e1) and (e3 > e5) and \
            (abs(e1 - e5) <= 0.03*np.mean([e1,e5])) and \
            (abs(e2 - e4) <= 0.03*np.mean([e1,e5])):
                patterns['HS'].append((window.index[0], window.index[-1]))

        # Inverse Head and Shoulders
        elif (e1 < e2) and (e3 < e1) and (e3 < e5) and \
            (abs(e1 - e5) <= 0.03*np.mean([e1,e5])) and \
            (abs(e2 - e4) <= 0.03*np.mean([e1,e5])):
                patterns['IHS'].append((window.index[0], window.index[-1]))

        # Broadening Top
        elif (e1 > e2) and (e1 < e3) and (e3 < e5) and (e2 > e4):
            patterns['BTOP'].append((window.index[0], window.index[-1]))

        # Broadening Bottom
        elif (e1 < e2) and (e1 > e3) and (e3 > e5) and (e2 < e4):
            patterns['BBOT'].append((window.index[0], window.index[-1]))

        # Triangle Top
        elif (e1 > e2) and (e1 > e3) and (e3 > e5) and (e2 < e4):
            patterns['TTOP'].append((window.index[0], window.index[-1]))

        # Triangle Bottom
        elif (e1 < e2) and (e1 < e3) and (e3 < e5) and (e2 > e4):
            patterns['TBOT'].append((window.index[0], window.index[-1]))

        # Rectangle Top
        elif (e1 > e2) and (abs(e1-rtop_g1)/rtop_g1 < 0.0075) and \
            (abs(e3-rtop_g1)/rtop_g1 < 0.0075) and (abs(e5-rtop_g1)/rtop_g1 < 0.0075) and \
            (abs(e2-rtop_g2)/rtop_g2 < 0.0075) and (abs(e4-rtop_g2)/rtop_g2 < 0.0075) and \
            (min(e1, e3, e5) > max(e2, e4)):

            patterns['RTOP'].append((window.index[0], window.index[-1]))

        # Rectangle Bottom
        elif (e1 < e2) and (abs(e1-rtop_g1)/rtop_g1 < 0.0075) and \
            (abs(e3-rtop_g1)/rtop_g1 < 0.0075) and (abs(e5-rtop_g1)/rtop_g1 < 0.0075) and \
            (abs(e2-rtop_g2)/rtop_g2 < 0.0075) and (abs(e4-rtop_g2)/rtop_g2 < 0.0075) and \
            (max(e1, e3, e5) > min(e2, e4)):
            patterns['RBOT'].append((window.index[0], window.index[-1]))
            
    return patterns


def _pattern_identification(prices, indentification_lag):
    max_min = find_max_min(prices)
    
    # we are only interested in the last pattern (if multiple patterns are there)
    # and also the last min/max must have happened less than "indentification_lag"
    # days ago otherways it mush have already been identified or it is too late to be usefull
    max_min_last_window = None
    
    for i in reversed(range(len(max_min))):
        if (prices.index[-1] - max_min.index[i]) == indentification_lag:
            max_min_last_window = max_min.iloc[i-4:i+1]
            break

    if max_min_last_window is None:
        return np.nan
       
    # possibly identify a pattern in the selected window
    patterns = find_patterns(max_min_last_window)
    if len(patterns) != 1:
        return np.nan
    
    name, start_end_day_nums = patterns.iteritems().next()
    
    pattern_code = {
           'HS'   : -2,
           'IHS'  : 2,
           'BTOP' : -1,
           'BBOT' : 1,
           'TTOP' : -4,
           'TBOT' : 4,
           'RTOP' : -3,
           'RBOT' : 3,
    }
    
    return pattern_code[name]


class PatternFactor(CustomFactor):
    
    params = ('indentification_lag',)        
    inputs = [USEquityPricing.close]
    window_length = 40
    
    def compute(self, today, assets, out, close, indentification_lag):
        prices = pd.DataFrame(close, columns=assets)
        out[:] = prices.apply(_pattern_identification, args=(indentification_lag,))   

Tearsheet

In [18]:
factor_name='PatternFactor'
start_date='2013-1-1'
end_date='2016-5-1'
top_liquid=500
sector_names=None
days=[1, 2, 3, 4, 5, 6, 10]
nquantiles = 1
ret_type='beta_excess' # normal, market_excess or beta_excess
days_before=36
days_after=max(days)+10
In [8]:
def factor_cls(mask):
    return PatternFactor(mask=mask, window_length = 56, indentification_lag=10)
    
create_factor_tear_sheet(factor_cls, factor_name, start_date, end_date, top_liquid, sector_names,
                         days, nquantiles, ret_type, days_before, days_after)
Pattern  TTOP  entries  668
/usr/local/lib/python2.7/dist-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
Pattern  RTOP  entries  389
Pattern  HS  entries  1363
Pattern  BTOP  entries  771
Pattern  BBOT  entries  565