Notebook

# Exploring Machine Learning on Quantopian¶

Recently, Quantopian’s Chief Investment Officer, Jonathan Larkin, shared an industry insider’s overview of the professional quant equity workflow. This workflow is comprised of distinct stages including: (1) Universe Definition, (2) Alpha Discovery, (3) Alpha Combination, (4) Portfolio Construction and (5) Trading.

This Notebook focuses on stage 3: Alpha Combination. At this stage, Machine Learning is an intuitive choice as we have abstracted the problem to such a degree that it is now a classic classification (or regression) problem which ML is very good at solving and coming up with an alpha combination that is predictive.

As you will see, there is a lot of code here setting up a factor library and some data wrangling to get everything into shape. The details of this part are perhaps not quite as interesting so feel free to skip directly to "Training our ML pipeline" where we have everything in place to train and test our classifier.

## Overview¶

1. Define trading universe to use (Q500US and Q1500US).
2. Define alphas (implemented in Pipeline).
3. Run pipeline.
4. Split into train and test set.
5. Preprocess data (rank alphas, subsample, align alphas with future returns, impute, scale).
6. Train Machine Learning classifier (AdaBoost from Scikit-Learn).
7. Evaluate Machine Learning classifier on test set.

Note that one important limitation is that we only train and test on static (i.e. fixed-in-time) data. Thus, you can not directly do the same in an algorithm. In principle, this is possible and will be the next step but it makes sense to first focus on just the ML in a more direct way to get a good intuition about the workflow and how to develop a competitive ML pipeline.

### Disclaimer¶

This workflow is still a bit rough around the edges. We are working on improving it and adding better educational materials. This serves as a sneak-peek for the curious and adventurous.

In [1]:
from quantopian.research import run_pipeline
from quantopian.pipeline import Pipeline
from quantopian.pipeline.factors import Latest
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.data.psychsignal import stocktwits_free
from quantopian.pipeline.data.logitbot import precog_top_100
from quantopian.pipeline.factors import CustomFactor, SimpleMovingAverage, AverageDollarVolume, Returns, RSI, Latest
from quantopian.pipeline.classifiers.morningstar import Sector
from quantopian.pipeline.filters import Q500US, Q1500US
from quantopian.pipeline.data.quandl import fred_usdontd156n as libor
from quantopian.pipeline.data.zacks import EarningsSurprises

import talib
import pandas as pd
import numpy as np
from time import time

import alphalens as al
import pyfolio as pf
from scipy import stats
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, ensemble, preprocessing, isotonic, metrics


## Definition of some commonly used factors¶

The factors below are a small collection of commonly used alphas that were coded by Gil Wassermann. I will post a separate Notebook with the full collection and more descriptions of them. Ultimately we will put these into a library you can just import to avoid the wall of text. If you want to understand more about pipeline, read the tutorial.

Also note the Earnings_Quality alpha which uses Zacks Earnings Surprises, a new source from our partners.

The details of these factors are not the focus of this Notebook so feel free to just skip this cell.

In [2]:
bs = morningstar.balance_sheet
cfs = morningstar.cash_flow_statement
is_ = morningstar.income_statement
or_ = morningstar.operation_ratios
er = morningstar.earnings_report
v = morningstar.valuation
vr = morningstar.valuation_ratios

def make_factors():
def Asset_Growth_3M():
return Returns(inputs=[bs.total_assets], window_length=63)

def Asset_To_Equity_Ratio():
return bs.total_assets.latest / bs.common_stock_equity.latest

def Capex_To_Cashflows():
return (cfs.capital_expenditure.latest * 4.) / \
(cfs.free_cash_flow.latest * 4.)

def EBITDA_Yield():
return (is_.ebitda.latest * 4.) / \
USEquityPricing.close.latest

def EBIT_To_Assets():
return (is_.ebit.latest * 4.) / \
bs.total_assets.latest

def Earnings_Quality():
return morningstar.cash_flow_statement.operating_cash_flow.latest / \
EarningsSurprises.eps_act.latest

def Return_On_Total_Invest_Capital():
return or_.roic.latest

class Mean_Reversion_1M(CustomFactor):
inputs = [Returns(window_length=21)]
window_length = 252

def compute(self, today, assets, out, monthly_rets):
out[:] = (monthly_rets[-1] - np.nanmean(monthly_rets, axis=0)) / \
np.nanstd(monthly_rets, axis=0)

class MACD_Signal_10d(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 60

def compute(self, today, assets, out, close):

sig_lines = []

for col in close.T:
# get signal line only
try:
_, signal_line, _ = talib.MACD(col, fastperiod=12,
slowperiod=26, signalperiod=10)
sig_lines.append(signal_line[-1])
# if error calculating, return NaN
except:
sig_lines.append(np.nan)
out[:] = sig_lines

class Moneyflow_Volume_5d(CustomFactor):
inputs = [USEquityPricing.close, USEquityPricing.volume]
window_length = 5

def compute(self, today, assets, out, close, volume):

mfvs = []

for col_c, col_v in zip(close.T, volume.T):

# denominator
denominator = np.dot(col_c, col_v)

# numerator
numerator = 0.
for n, price in enumerate(col_c.tolist()):
if price > col_c[n - 1]:
numerator += price * col_v[n]
else:
numerator -= price * col_v[n]

mfvs.append(numerator / denominator)
out[:] = mfvs

def Net_Income_Margin():
return or_.net_margin.latest

def Operating_Cashflows_To_Assets():
return (cfs.operating_cash_flow.latest * 4.) / \
bs.total_assets.latest

def Price_Momentum_3M():
return Returns(window_length=63)

class Price_Oscillator(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 252

def compute(self, today, assets, out, close):
four_week_period = close[-20:]
out[:] = (np.nanmean(four_week_period, axis=0) /
np.nanmean(close, axis=0)) - 1.

def Returns_39W():
return Returns(window_length=215)

class Trendline(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 252

# using MLE for speed
def compute(self, today, assets, out, close):

# prepare X matrix (x_is - x_bar)
X = range(self.window_length)
X_bar = np.nanmean(X)
X_vector = X - X_bar
X_matrix = np.tile(X_vector, (len(close.T), 1)).T

# prepare Y matrix (y_is - y_bar)
Y_bar = np.nanmean(close, axis=0)
Y_bars = np.tile(Y_bar, (self.window_length, 1))
Y_matrix = close - Y_bars

# prepare variance of X
X_var = np.nanvar(X)

# multiply X matrix an Y matrix and sum (dot product)
# then divide by variance of X
# this gives the MLE of Beta
out[:] = (np.sum((X_matrix * Y_matrix), axis=0) / X_var) / \
(self.window_length)

class Vol_3M(CustomFactor):
inputs = [Returns(window_length=2)]
window_length = 63

def compute(self, today, assets, out, rets):
out[:] = np.nanstd(rets, axis=0)

def Working_Capital_To_Assets():
return bs.working_capital.latest / bs.total_assets.latest

def predicted_five_day_log_return():
return precog_top_100.predicted_five_day_log_return.latest

def bull_minus_bear():
return stocktwits_free.bull_minus_bear.latest

all_factors = {
#'Asset Growth 3M': Asset_Growth_3M,
#'Asset to Equity Ratio': Asset_To_Equity_Ratio,
#'Capex to Cashflows': Capex_To_Cashflows,
#'EBIT to Assets': EBIT_To_Assets,
'EBITDA Yield': EBITDA_Yield,
#'Earnings Quality': Earnings_Quality,
#'MACD Signal Line': MACD_Signal_10d,
#'Mean Reversion 1M': Mean_Reversion_1M,
#'Moneyflow Volume 5D': Moneyflow_Volume_5d,
#'Net Income Margin': Net_Income_Margin,
#'Operating Cashflows to Assets': Operating_Cashflows_To_Assets,
#'Price Momentum 3M': Price_Momentum_3M,
#'Price Oscillator': Price_Oscillator,
'Return on Invest Capital': Return_On_Total_Invest_Capital,
#'39 Week Returns': Returns_39W,
#'Trendline': Trendline,
#'Vol 3M': Vol_3M,
#'Working Capital to Assets': Working_Capital_To_Assets,
'predicted_five_day_log_return':predicted_five_day_log_return,
'bull_minus_bear':bull_minus_bear,
}

return all_factors


## Define universe and select factors to use¶

We will screen our universe using the new Q1500US and hand-pick a few alphas from the list above. We encourage you to play around with the factors.

In [3]:
universe = Q1500US() & (stocktwits_free.total_scanned_messages.latest>50) & (precog_top_100.predicted_five_day_log_return.latest.notnan())

factors = make_factors()


## Define and build the pipeline¶

Next we have to build the pipeline. In addition to the factors defined above, we need the forward returns we want to predict. In this Notebook we will predict 5-day returns and train our model on daily data. You can also subsample the data to e.g. weekly to not have overlapping return periods but we omit this here.

In [4]:
n_fwd_days = 5 # number of days to compute returns over

In [5]:
def make_history_pipeline(factors, universe, n_fwd_days=5):
# Call .rank() on all factors and mask out the universe
factor_ranks = {name: f().rank(mask=universe) for name, f in factors.iteritems()}
# Get cumulative returns over last n_fwd_days days. We will later shift these.
factor_ranks['Returns'] = Returns(inputs=[USEquityPricing.open],

pipe = Pipeline(screen=universe, columns=factor_ranks)

return pipe

In [6]:
history_pipe = make_history_pipeline(factors, universe, n_fwd_days=n_fwd_days)


## Run the pipeline¶

In [7]:
start_timer = time()
start = pd.Timestamp("2016-03-06")
end = pd.Timestamp("2016-09-14")
results = run_pipeline(history_pipe, start_date=start, end_date=end)
results.index.names = ['date', 'security']
end_timer = time()

In [147]:
start_timer = time()
dates = [
("2013-01-01","2013-07-05"),
("2013-07-06","2013-12-31"),
("2014-01-01","2014-06-30"),
("2014-07-01","2014-12-31"),
("2015-01-01","2015-06-30"),
("2015-07-01","2015-12-31"),
("2016-01-01","2016-06-30"),
("2016-07-01","2016-09-30"),]

for date in dates:
start = pd.Timestamp(date[0])
end = pd.Timestamp(date[1])
if date == dates[0]:
results = run_pipeline(history_pipe, start_date=start, end_date=end)
else:
results = results.append(run_pipeline(history_pipe, start_date=start, end_date=end))
results.index.names = ['date', 'security']
print 'requested dates ', date
end_timer = time()

requested dates  ('2013-01-01', '2013-07-05')
requested dates  ('2013-07-06', '2013-12-31')
requested dates  ('2014-01-01', '2014-06-30')
requested dates  ('2014-07-01', '2014-12-31')
requested dates  ('2015-01-01', '2015-06-30')
requested dates  ('2015-07-01', '2015-12-31')
requested dates  ('2016-01-01', '2016-06-30')
requested dates  ('2016-07-01', '2016-09-30')

In [8]:
print "Time to run pipeline %.2f secs" % (end_timer - start_timer)

Time to run pipeline 67.11 secs

In [9]:
results.head()

Out[9]:
EBITDA Yield Return on Invest Capital Returns bull_minus_bear predicted_five_day_log_return
date security
2016-03-07 00:00:00+00:00 Equity(24 [AAPL]) 2.0 2.0 0.056886 1.0 1.0
Equity(42950 [FB]) 1.0 1.0 0.022770 2.0 2.0
2016-03-08 00:00:00+00:00 Equity(24 [AAPL]) 11.0 12.0 0.048541 7.0 11.0
Equity(700 [BAC]) 13.0 3.0 0.068137 12.0 13.0
Equity(1267 [CAT]) 4.0 2.0 0.063705 10.0 7.0
In [10]:
results.tail()

Out[10]:
EBITDA Yield Return on Invest Capital Returns bull_minus_bear predicted_five_day_log_return
date security
2016-09-14 00:00:00+00:00 Equity(8151 [WFC]) 10.0 4.0 -0.039968 2.0 4.0
Equity(13197 [FCX]) 8.0 2.0 -0.005602 6.0 2.0
Equity(16841 [AMZN]) 1.0 7.0 -0.026294 9.0 6.0
Equity(26578 [GOOG_L]) 3.0 6.0 -0.017229 1.0 8.0
Equity(42950 [FB]) 4.0 9.0 -0.015449 7.0 11.0
In [11]:
#results.loc(axis=0)[:,24][['Returns','close']]

In [12]:
#for i,a in enumerate(results.index.duplicated()):
#    if a == True:
#        print i

In [13]:
results.shape
#len(results[results.index != 657])

Out[13]:
(1716, 5)

As you can see, running pipeline gives us factors for every day and every security, ranked relative to each other. We assume that the order of individual factors might carry some weak predictive power on future returns. The question then becomes: how can we combine these weakly predictive factors in a clever way to get a single mega-alpha which is hopefully more predictive.

This is an important milestone. We have our ranked factor values on each day for each stock. Ranking is not absolutely necessary but has several benefits:

• it increases robustness to outliers,
• we mostly care about the relative ordering rather than the absolute values.

Also note the Returns column. These are the values we want to predict given the factor ranks.

Next, we are doing some additional transformations to our data:

1. Shift factor ranks to align with future returns n_fwd_days days in the future.
2. Find the top and bottom 30 percentile stocks by their returns. Essentially, we only care about relative movement of stocks. If we later short stocks that go down and long stocks going up relative to each other, it doesn't matter if e.g. all stocks are going down in absolute terms. Moverover, we are ignoring stocks that did not move that much (i.e. 30th to 70th percentile) to only train the classifier on those that provided strong signal.
3. We also binarize the returns by their percentile to turn our ML problem into a classification one.

shift_mask_data() is a utility function that does all of these.

In [14]:
def shift_mask_data(X, Y, upper_percentile=90, lower_percentile=10, n_fwd_days=1):
# Shift X to match factors at t to returns at t+n_fwd_days (we want to predict future returns after all)
shifted_X = np.roll(X, n_fwd_days, axis=0)

# Slice off rolled elements
X = shifted_X[n_fwd_days:]
Y = Y[n_fwd_days:]

n_time, n_stocks, n_factors = X.shape

# Look for biggest up and down movers
upper = np.nanpercentile(Y, upper_percentile, axis=1)[:, np.newaxis]
lower = np.nanpercentile(Y, lower_percentile, axis=1)[:, np.newaxis]

# Only try to predict whether a stock moved up/down relative to other stocks
Y_binary = np.zeros(n_time * n_stocks)

# Flatten X
X = X.reshape((n_time * n_stocks, n_factors))

# Drop stocks that did not move much (i.e. are in the 30th to 70th percentile)

return X, Y_binary


After we have our helper function to align our data properly we pass our factor ranks to it. You might wonder why we have to do the swapaxes thing below rather than just using pandas logic. The reason is that this way we can use the same shift_mask_data() function inside of a factor where we do not have access to a Pandas DataFrame. More on that in a future notebook.

In [15]:
# Massage data to be in the form expected by shift_mask_data()
results_wo_returns = results.copy()
returns = results_wo_returns.pop('Returns')
Y = returns.unstack().values
X = results_wo_returns.to_panel()
X = X.swapaxes(2, 0).swapaxes(0, 1).values # (factors, time, stocks) -> (time, stocks, factors)


Next, we split our data into training (80%) and test (20%). This is common practice: our classifier will try to fit the training set as well as possible but it does not tell us how well it would perform on unseen data. Because we are dealing with time-series data we split along the time-dimension to only test on future data.

In [16]:
# Train-test split
train_size_perc = 0.8
n_time, n_stocks, n_factors = X.shape
train_size = np.int16(np.round(train_size_perc * n_time))
X_train, Y_train = X[:train_size, ...], Y[:train_size]
X_test, Y_test = X[(train_size+n_fwd_days):, ...], Y[(train_size+n_fwd_days):]

In [17]:
X.shape, X_train.shape, X_test.shape

Out[17]:
((134, 99, 4), (107, 99, 4), (22, 99, 4))

As we can only exclude stocks that did not move by a lot (i.e. 30th to 70th percentile) during training, we keep all stocks in our test set and just binarize according to the median. This avoids look-ahead bias.

In [18]:
X_train_shift, Y_train_shift = shift_mask_data(X_train, Y_train, n_fwd_days=n_fwd_days)
X_test_shift, Y_test_shift = shift_mask_data(X_test, Y_test, n_fwd_days=n_fwd_days,
lower_percentile=50,
upper_percentile=50)

In [19]:
X_train_shift.shape, X_test_shift.shape

Out[19]:
((380, 4), (175, 4))
In [20]:
Y_train_shift.shape

Out[20]:
(380,)

## Training our ML pipeline¶

Before training our classifier, several preprocessing steps are advisable. The first one imputes nan values with the factor mean to get clean training data, the second scales our factor ranks to be between [0, 1).

For training we are using the AdaBoost classifier which automatically determines the most relevant features (factors) and tries to find a non-linear combination of features to maximize predictiveness while still being robust. In essence, AdaBoost trains an ensemble of weak classifiers (decision trees in this case) sequentially. Each subsequent weak classifier takes into account the samples (or data points) already classified by the previous weak classifiers. It then focuses on the samples misclassified by the previous weak classifiers and tries to get those correctly. With each new weak classifier you get more fine-grained in your decision function and correctly classify some previously misclassified samples. For prediction, you simply average the answer of all weak classifiers to get a single strong classifier.

Of course, this is just an example and you can let your creativity and skill roam freely.

In [21]:
start_timer = time()

# Train classifier
imputer = preprocessing.Imputer()
scaler = preprocessing.MinMaxScaler()
#clf = ensemble.AdaBoostClassifier(n_estimators=150) # n_estimators controls how many weak classifiers are fi

#from sklearn.linear_model import LogisticRegression
#clf = LogisticRegression(C = 1, tol=0.01)

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()

#from sklearn import svm
#clf = svm.SVC(probability=True)

X_train_trans = imputer.fit_transform(X_train_shift)
X_train_trans = scaler.fit_transform(X_train_trans)
clf.fit(X_train_trans, Y_train_shift)

end_timer = time()

In [22]:
print "Time to train full ML pipline: %0.2f secs" % (end_timer - start_timer)

Time to train full ML pipline: 0.00 secs


As you can see, training a modern ML classifier does not have to be very compute intensive. Scikit-learn is heavily optimized so the full process only takes less than 10 seconds. Of course, things like deep-learning (which is currently not available on Quantopian), might take a bit longer, but these models are also trained on data sets much much bigger than this (a famous subset of the ImageNet data set is 138 GB).

This means that the current bottlenecks are retrieving the data from pipeline (RAM and i/o), not lack of GPU or parallel processing support.

In [23]:
Y_pred = clf.predict(X_train_trans)
print('Accuracy on train set = {:.2f}%'.format(metrics.accuracy_score(Y_train_shift, Y_pred) * 100))

Accuracy on train set = 55.53%


The classifier does reasonably well on the data we trained it on, but the real test is on hold-out data.

Exercise: It is also common to run cross-validation on the training data and tweak the parameters based on that score, testing should only be done rarely. Try coding a sklearn pipeline with K-Fold cross-validation.

## Evaluating our ML classifier¶

To evaluate our ML classifer on the test set, we have to transform our test data in the same way as our traning data. Note that we are only calling the .transform() method here which does not use any information from the test set.

In [24]:
# Transform test data
X_test_trans = imputer.transform(X_test_shift)
X_test_trans = scaler.transform(X_test_trans)


After all this work, we can finally test our classifier. We can predict binary labels but also get probability estimates.

In [25]:
# Predict!
Y_pred = clf.predict(X_test_trans)

Y_pred_prob = clf.predict_proba(X_test_trans)

In [26]:
print 'Predictions:', Y_pred
print 'Probabilities of class == 1:', Y_pred_prob[:, 1] * 100

Predictions: [ 1.  1.  1.  1.  1. -1. -1.  1. -1. -1.  1.  1.  1. -1. -1. -1. -1. -1.
-1. -1.  1.  1. -1. -1. -1.  1. -1.  1. -1. -1.  1. -1.  1.  1.  1.  1.
-1.  1. -1. -1. -1.  1.  1.  1. -1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
-1. -1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1. -1.  1.  1. -1.
-1.  1.  1.  1.  1.  1. -1. -1. -1. -1.  1.  1.  1. -1.  1.  1.  1.  1.
-1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1. -1. -1.  1.
-1.  1.  1.  1.  1.  1. -1.  1.  1.  1.  1.  1.  1. -1.  1.  1.  1.  1.
1.  1.  1.  1. -1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
Probabilities of class == 1: [ 64.15854632  60.74011468  60.74011468  60.74011468  67.19596022
29.35105656  41.49067948  60.74011468  48.51738873  27.73089413
60.74011468  60.74011468  57.77106774  47.62712594  42.06789279
5.34561024  32.71880583   8.1820819    9.72525106  17.91873318
60.74011468  60.74011468  48.3008313   30.07017037   6.74356966
58.92648816  45.53962611  50.61840852  47.71842546  40.080659
60.74011468   4.41591747  60.74011468  60.74011468  60.74011468
60.74011468  32.52593504  60.74011468  22.26633784  15.89878195
26.70107047  58.64216684  60.74011468  60.74011468  47.85989399
60.74011468  59.94519654  60.74011468  50.64398317  60.74011468
60.74011468  60.74011468  51.7398981   60.74011468  60.74011468
64.92526874  60.74011468  61.13567043  65.60430074  69.18707182
60.34575055  60.74011468  62.53689856  65.44792782  59.07503332
60.74011468  68.29926204  60.74011468  60.74011468  67.11383405
57.5383125   56.40009469  45.73845661  48.18697445  60.74011468
60.74011468  53.42919124  60.74011468  66.58579517  60.74011468
60.74011468  60.74011468  68.50327028  60.74011468  64.58222403
56.94419025  40.25559669  57.2007305   61.41333862  34.15506734
34.10924298  58.44994317  67.710483    60.74011468  60.74011468
66.28737009  49.49302272  38.97819318  17.72502033  16.73660097
60.74011468  60.74011468  60.74011468  28.99103363  57.43745303
50.1647689   56.33349637  60.74011468  35.82882408  65.05211773
65.24060007  61.48714814  54.03878839  61.74511182  60.74011468
56.1387466   60.74011468  60.74011468  68.84125368  62.60381098
65.07986885  54.02014565  60.74011468  34.3264257   44.39014867
60.74011468  34.60926752  60.74011468  60.74011468  60.74011468
60.74011468  58.66060433  45.2880617   54.30735803  58.99356798
60.74011468  61.14525482  56.77115474  61.99810919  46.65575086
60.74011468  60.74011468  60.74011468  60.74011468  55.40321685
60.74011468  68.15442596  57.28522267  48.72691917  64.81080433
60.56917734  63.60749938  60.74011468  60.74011468  60.74011468
60.74011468  60.74011468  60.74011468  60.74011468  60.74011468
60.74011468  60.74011468  60.74011468  66.65437648  64.52434044
60.74011468  60.74011468  53.7878956   61.90463837  57.66106124
60.74011468  60.74011468  68.23248223  59.19813372  66.6522005 ]

In [27]:
verg = pd.DataFrame({'orig':Y_test_shift,'pred':Y_pred,'prob':Y_pred_prob[:, 1] * 100})

In [28]:
#verg


There are many ways to evaluate the performance of our classifier. The simplest and most intuitive one is certainly the accuracy (50% is chance due to our median split). On Kaggle competitions, you will also often find the log-loss being used. This punishes you for being wrong and confident in your answer. See the Kaggle description for more motivation.

In [29]:
print('Accuracy on test set = {:.2f}%'.format(metrics.accuracy_score(Y_test_shift, Y_pred) * 100))
print('Log-loss = {:.5f}'.format(metrics.log_loss(Y_test_shift, Y_pred_prob)))

Accuracy on test set = 50.86%
Log-loss = 0.72092


The following would be done by a pipline, where one factor is the probability

In [30]:
perc_verg = verg[verg.prob >= np.nanpercentile(verg['prob'],90)].append(verg[verg.prob <= np.nanpercentile(verg['prob'],10)])

In [31]:
print('Accuracy on test set with selected percentiles = {:.2f}%'.format(metrics.accuracy_score(perc_verg['orig'], perc_verg['pred']) * 100))
print('Number of trades in selected test set: ' + str(len(perc_verg)))

Accuracy on test set with selected percentiles = 61.11%
Number of trades in selected test set: 36


Seems like we're at chance on this data set, alas. But perhaps you can do better?

We can also examine which factors the classifier identified as most predictive.

In [74]:
feature_importances = pd.Series(clf.feature_importances_, index=results_wo_returns.columns)
feature_importances.sort(ascending=False)
ax = feature_importances.plot(kind='bar')
ax.set(ylabel='Importance (Gini Coefficient)', title='Feature importances');

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-74-041c062a1783> in <module>()
----> 1 feature_importances = pd.Series(clf.feature_importances_, index=results_wo_returns.columns)
2 feature_importances.sort(ascending=False)
3 ax = feature_importances.plot(kind='bar')
4 ax.set(ylabel='Importance (Gini Coefficient)', title='Feature importances');

AttributeError: 'GaussianNB' object has no attribute 'feature_importances_'

Exercise: Use partial dependence plots to get an understanding of how factor rankings are used to predict future returns.

## Where to go from here¶

Several knobs can be tweaked to boost performance:

## Machine Learning competition¶

If you have something you think works well, post it in this thread. Make sure to test over the same time-period as I have here to keep things comparable. In a month from now, we can test on new data that has aggregated since then and determine who built the best ML pipeline. If there is demand, we might turn this into a proper ML contest.

## Machine Learning resources¶

If you look for information on how to get started with ML, here are a few resources:

## How to put this into an algorithm¶

As mentioned above, this is not immediately usable in an algorithm. For one thing, there is no run_pipeline() in the backtest IDE. It turns out to be rather simple to take the code above and put it into a pipeline CustomFactor() where the ML model would automatically get retrained and make predictions. You would then long the 1 predictions and short the -1 predictions, apply some weighting (e.g. inverse variance) and execute orders. More on these next steps in the future.

## Credits¶

• Content created by James Christopher and Thomas Wiecki
• Thanks to Sheng Wang for ideas and inspiration.
• Jonathan Larkin, Jess Stauth, Delaney Granizo-Mackenzie, and Jean Bredeche for helpful comments on an earlier draft.
In [ ]: