Recently, Quantopianâ€™s Chief Investment Officer, Jonathan Larkin, shared an industry insiderâ€™s overview of the professional quant equity workflow. This workflow is comprised of distinct stages including: (1) Universe Definition, (2) Alpha Discovery, (3) Alpha Combination, (4) Portfolio Construction and (5) Trading.

This Notebook focuses on stage 3: Alpha Combination. At this stage, Machine Learning is an intuitive choice as we have abstracted the problem to such a degree that it is now a classic classification (or regression) problem which ML is very good at solving and coming up with an alpha combination that is predictive.

As you will see, there is a lot of code here setting up a factor library and some data wrangling to get everything into shape. The details of this part are perhaps not quite as interesting so feel free to skip directly to "Training our ML pipeline" where we have everything in place to train and test our classifier.

- Define trading universe to use (Q500US and Q1500US).
- Define alphas (implemented in Pipeline).
- Run pipeline.
- Split into train and test set.
- Preprocess data (rank alphas, subsample, align alphas with future returns, impute, scale).
- Train Machine Learning classifier (AdaBoost from Scikit-Learn).
- Evaluate Machine Learning classifier on test set.

Note that one important limitation is that we only train and test on static (i.e. fixed-in-time) data. Thus, you can not directly do the same in an algorithm. In principle, this is possible and will be the next step but it makes sense to first focus on just the ML in a more direct way to get a good intuition about the workflow and how to develop a competitive ML pipeline.

This workflow is still a bit rough around the edges. We are working on improving it and adding better educational materials. This serves as a sneak-peek for the curious and adventurous.

In [1]:

```
from quantopian.research import run_pipeline
from quantopian.pipeline import Pipeline
from quantopian.pipeline.factors import Latest
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.data.psychsignal import stocktwits_free
from quantopian.pipeline.data.logitbot import precog_top_100
from quantopian.pipeline.factors import CustomFactor, SimpleMovingAverage, AverageDollarVolume, Returns, RSI, Latest
from quantopian.pipeline.classifiers.morningstar import Sector
from quantopian.pipeline.filters import Q500US, Q1500US
from quantopian.pipeline.data.quandl import fred_usdontd156n as libor
from quantopian.pipeline.data.zacks import EarningsSurprises
import talib
import pandas as pd
import numpy as np
from time import time
import alphalens as al
import pyfolio as pf
from scipy import stats
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, ensemble, preprocessing, isotonic, metrics
```

The factors below are a small collection of commonly used alphas that were coded by Gil Wassermann. I will post a separate Notebook with the full collection and more descriptions of them. Ultimately we will put these into a library you can just import to avoid the wall of text. If you want to understand more about pipeline, read the tutorial.

Also note the `Earnings_Quality`

alpha which uses Zacks Earnings Surprises, a new source from our partners.

The details of these factors are not the focus of this Notebook so feel free to just skip this cell.

In [2]:

```
bs = morningstar.balance_sheet
cfs = morningstar.cash_flow_statement
is_ = morningstar.income_statement
or_ = morningstar.operation_ratios
er = morningstar.earnings_report
v = morningstar.valuation
vr = morningstar.valuation_ratios
def make_factors():
def Asset_Growth_3M():
return Returns(inputs=[bs.total_assets], window_length=63)
def Asset_To_Equity_Ratio():
return bs.total_assets.latest / bs.common_stock_equity.latest
def Capex_To_Cashflows():
return (cfs.capital_expenditure.latest * 4.) / \
(cfs.free_cash_flow.latest * 4.)
def EBITDA_Yield():
return (is_.ebitda.latest * 4.) / \
USEquityPricing.close.latest
def EBIT_To_Assets():
return (is_.ebit.latest * 4.) / \
bs.total_assets.latest
def Earnings_Quality():
return morningstar.cash_flow_statement.operating_cash_flow.latest / \
EarningsSurprises.eps_act.latest
def Return_On_Total_Invest_Capital():
return or_.roic.latest
class Mean_Reversion_1M(CustomFactor):
inputs = [Returns(window_length=21)]
window_length = 252
def compute(self, today, assets, out, monthly_rets):
out[:] = (monthly_rets[-1] - np.nanmean(monthly_rets, axis=0)) / \
np.nanstd(monthly_rets, axis=0)
class MACD_Signal_10d(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 60
def compute(self, today, assets, out, close):
sig_lines = []
for col in close.T:
# get signal line only
try:
_, signal_line, _ = talib.MACD(col, fastperiod=12,
slowperiod=26, signalperiod=10)
sig_lines.append(signal_line[-1])
# if error calculating, return NaN
except:
sig_lines.append(np.nan)
out[:] = sig_lines
class Moneyflow_Volume_5d(CustomFactor):
inputs = [USEquityPricing.close, USEquityPricing.volume]
window_length = 5
def compute(self, today, assets, out, close, volume):
mfvs = []
for col_c, col_v in zip(close.T, volume.T):
# denominator
denominator = np.dot(col_c, col_v)
# numerator
numerator = 0.
for n, price in enumerate(col_c.tolist()):
if price > col_c[n - 1]:
numerator += price * col_v[n]
else:
numerator -= price * col_v[n]
mfvs.append(numerator / denominator)
out[:] = mfvs
def Net_Income_Margin():
return or_.net_margin.latest
def Operating_Cashflows_To_Assets():
return (cfs.operating_cash_flow.latest * 4.) / \
bs.total_assets.latest
def Price_Momentum_3M():
return Returns(window_length=63)
class Price_Oscillator(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 252
def compute(self, today, assets, out, close):
four_week_period = close[-20:]
out[:] = (np.nanmean(four_week_period, axis=0) /
np.nanmean(close, axis=0)) - 1.
def Returns_39W():
return Returns(window_length=215)
class Trendline(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 252
# using MLE for speed
def compute(self, today, assets, out, close):
# prepare X matrix (x_is - x_bar)
X = range(self.window_length)
X_bar = np.nanmean(X)
X_vector = X - X_bar
X_matrix = np.tile(X_vector, (len(close.T), 1)).T
# prepare Y matrix (y_is - y_bar)
Y_bar = np.nanmean(close, axis=0)
Y_bars = np.tile(Y_bar, (self.window_length, 1))
Y_matrix = close - Y_bars
# prepare variance of X
X_var = np.nanvar(X)
# multiply X matrix an Y matrix and sum (dot product)
# then divide by variance of X
# this gives the MLE of Beta
out[:] = (np.sum((X_matrix * Y_matrix), axis=0) / X_var) / \
(self.window_length)
class Vol_3M(CustomFactor):
inputs = [Returns(window_length=2)]
window_length = 63
def compute(self, today, assets, out, rets):
out[:] = np.nanstd(rets, axis=0)
def Working_Capital_To_Assets():
return bs.working_capital.latest / bs.total_assets.latest
def predicted_five_day_log_return():
return precog_top_100.predicted_five_day_log_return.latest
def bull_minus_bear():
return stocktwits_free.bull_minus_bear.latest
all_factors = {
#'Asset Growth 3M': Asset_Growth_3M,
#'Asset to Equity Ratio': Asset_To_Equity_Ratio,
#'Capex to Cashflows': Capex_To_Cashflows,
#'EBIT to Assets': EBIT_To_Assets,
'EBITDA Yield': EBITDA_Yield,
#'Earnings Quality': Earnings_Quality,
#'MACD Signal Line': MACD_Signal_10d,
#'Mean Reversion 1M': Mean_Reversion_1M,
#'Moneyflow Volume 5D': Moneyflow_Volume_5d,
#'Net Income Margin': Net_Income_Margin,
#'Operating Cashflows to Assets': Operating_Cashflows_To_Assets,
#'Price Momentum 3M': Price_Momentum_3M,
#'Price Oscillator': Price_Oscillator,
'Return on Invest Capital': Return_On_Total_Invest_Capital,
#'39 Week Returns': Returns_39W,
#'Trendline': Trendline,
#'Vol 3M': Vol_3M,
#'Working Capital to Assets': Working_Capital_To_Assets,
'predicted_five_day_log_return':predicted_five_day_log_return,
'bull_minus_bear':bull_minus_bear,
}
return all_factors
```

We will screen our universe using the new Q1500US and hand-pick a few alphas from the list above. We encourage you to play around with the factors.

In [3]:

```
universe = Q1500US() & (stocktwits_free.total_scanned_messages.latest>50) & (precog_top_100.predicted_five_day_log_return.latest.notnan())
factors = make_factors()
```

Next we have to build the pipeline. In addition to the factors defined above, we need the forward returns we want to predict. In this Notebook we will predict 5-day returns and train our model on daily data. You can also subsample the data to e.g. weekly to not have overlapping return periods but we omit this here.

In [4]:

```
n_fwd_days = 5 # number of days to compute returns over
```

In [5]:

```
def make_history_pipeline(factors, universe, n_fwd_days=5):
# Call .rank() on all factors and mask out the universe
factor_ranks = {name: f().rank(mask=universe) for name, f in factors.iteritems()}
# Get cumulative returns over last n_fwd_days days. We will later shift these.
factor_ranks['Returns'] = Returns(inputs=[USEquityPricing.open],
mask=universe, window_length=n_fwd_days)
#factor_ranks['close'] = Latest([USEquityPricing.close], mask=universe)
pipe = Pipeline(screen=universe, columns=factor_ranks)
return pipe
```

In [6]:

```
history_pipe = make_history_pipeline(factors, universe, n_fwd_days=n_fwd_days)
```

In [7]:

```
start_timer = time()
start = pd.Timestamp("2016-03-06")
end = pd.Timestamp("2016-09-14")
results = run_pipeline(history_pipe, start_date=start, end_date=end)
results.index.names = ['date', 'security']
end_timer = time()
```

In [147]:

```
start_timer = time()
dates = [
("2013-01-01","2013-07-05"),
("2013-07-06","2013-12-31"),
("2014-01-01","2014-06-30"),
("2014-07-01","2014-12-31"),
("2015-01-01","2015-06-30"),
("2015-07-01","2015-12-31"),
("2016-01-01","2016-06-30"),
("2016-07-01","2016-09-30"),]
for date in dates:
start = pd.Timestamp(date[0])
end = pd.Timestamp(date[1])
if date == dates[0]:
results = run_pipeline(history_pipe, start_date=start, end_date=end)
else:
results = results.append(run_pipeline(history_pipe, start_date=start, end_date=end))
results.index.names = ['date', 'security']
print 'requested dates ', date
end_timer = time()
```

In [8]:

```
print "Time to run pipeline %.2f secs" % (end_timer - start_timer)
```

In [9]:

```
results.head()
```

Out[9]:

In [10]:

```
results.tail()
```

Out[10]:

In [11]:

```
#results.loc(axis=0)[:,24][['Returns','close']]
```

In [12]:

```
#for i,a in enumerate(results.index.duplicated()):
# if a == True:
# print i
```

In [13]:

```
results.shape
#len(results[results.index != 657])
```

Out[13]:

This is an important milestone. We have our ranked factor values on each day for each stock. Ranking is not absolutely necessary but has several benefits:

- it increases robustness to outliers,
- we mostly care about the relative ordering rather than the absolute values.

Also note the `Returns`

column. These are the values we want to predict given the factor ranks.

Next, we are doing some additional transformations to our data:

- Shift factor ranks to align with future returns
`n_fwd_days`

days in the future. - Find the top and bottom 30 percentile stocks by their returns. Essentially, we only care about relative movement of stocks. If we later short stocks that go down and long stocks going up relative to each other, it doesn't matter if e.g. all stocks are going down in absolute terms. Moverover, we are ignoring stocks that did not move that much (i.e. 30th to 70th percentile) to only train the classifier on those that provided strong signal.
- We also binarize the returns by their percentile to turn our ML problem into a classification one.

`shift_mask_data()`

is a utility function that does all of these.

In [14]:

```
def shift_mask_data(X, Y, upper_percentile=90, lower_percentile=10, n_fwd_days=1):
# Shift X to match factors at t to returns at t+n_fwd_days (we want to predict future returns after all)
shifted_X = np.roll(X, n_fwd_days, axis=0)
# Slice off rolled elements
X = shifted_X[n_fwd_days:]
Y = Y[n_fwd_days:]
n_time, n_stocks, n_factors = X.shape
# Look for biggest up and down movers
upper = np.nanpercentile(Y, upper_percentile, axis=1)[:, np.newaxis]
lower = np.nanpercentile(Y, lower_percentile, axis=1)[:, np.newaxis]
upper_mask = (Y >= upper)
lower_mask = (Y <= lower)
mask = upper_mask | lower_mask # This also drops nans
mask = mask.flatten()
# Only try to predict whether a stock moved up/down relative to other stocks
Y_binary = np.zeros(n_time * n_stocks)
Y_binary[upper_mask.flatten()] = 1
Y_binary[lower_mask.flatten()] = -1
# Flatten X
X = X.reshape((n_time * n_stocks, n_factors))
# Drop stocks that did not move much (i.e. are in the 30th to 70th percentile)
X = X[mask]
Y_binary = Y_binary[mask]
return X, Y_binary
```

`swapaxes`

thing below rather than just using `pandas`

logic. The reason is that this way we can use the same `shift_mask_data()`

function inside of a factor where we do not have access to a Pandas `DataFrame`

. More on that in a future notebook.

In [15]:

```
# Massage data to be in the form expected by shift_mask_data()
results_wo_returns = results.copy()
returns = results_wo_returns.pop('Returns')
Y = returns.unstack().values
X = results_wo_returns.to_panel()
X = X.swapaxes(2, 0).swapaxes(0, 1).values # (factors, time, stocks) -> (time, stocks, factors)
```

In [16]:

```
# Train-test split
train_size_perc = 0.8
n_time, n_stocks, n_factors = X.shape
train_size = np.int16(np.round(train_size_perc * n_time))
X_train, Y_train = X[:train_size, ...], Y[:train_size]
X_test, Y_test = X[(train_size+n_fwd_days):, ...], Y[(train_size+n_fwd_days):]
```

In [17]:

```
X.shape, X_train.shape, X_test.shape
```

Out[17]:

In [18]:

```
X_train_shift, Y_train_shift = shift_mask_data(X_train, Y_train, n_fwd_days=n_fwd_days)
X_test_shift, Y_test_shift = shift_mask_data(X_test, Y_test, n_fwd_days=n_fwd_days,
lower_percentile=50,
upper_percentile=50)
```

In [19]:

```
X_train_shift.shape, X_test_shift.shape
```

Out[19]:

In [20]:

```
Y_train_shift.shape
```

Out[20]:

Before training our classifier, several preprocessing steps are advisable. The first one imputes nan values with the factor mean to get clean training data, the second scales our factor ranks to be between [0, 1).

For training we are using the AdaBoost classifier which automatically determines the most relevant features (factors) and tries to find a non-linear combination of features to maximize predictiveness while still being robust. In essence, AdaBoost trains an ensemble of weak classifiers (decision trees in this case) sequentially. Each subsequent weak classifier takes into account the samples (or data points) already classified by the previous weak classifiers. It then focuses on the samples misclassified by the previous weak classifiers and tries to get those correctly. With each new weak classifier you get more fine-grained in your decision function and correctly classify some previously misclassified samples. For prediction, you simply average the answer of all weak classifiers to get a single strong classifier.

Of course, this is just an example and you can let your creativity and skill roam freely.

In [21]:

```
start_timer = time()
# Train classifier
imputer = preprocessing.Imputer()
scaler = preprocessing.MinMaxScaler()
#clf = ensemble.AdaBoostClassifier(n_estimators=150) # n_estimators controls how many weak classifiers are fi
#from sklearn.linear_model import LogisticRegression
#clf = LogisticRegression(C = 1, tol=0.01)
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
#from sklearn import svm
#clf = svm.SVC(probability=True)
X_train_trans = imputer.fit_transform(X_train_shift)
X_train_trans = scaler.fit_transform(X_train_trans)
clf.fit(X_train_trans, Y_train_shift)
end_timer = time()
```

In [22]:

```
print "Time to train full ML pipline: %0.2f secs" % (end_timer - start_timer)
```

As you can see, training a modern ML classifier does not have to be very compute intensive. Scikit-learn is heavily optimized so the full process only takes less than 10 seconds. Of course, things like deep-learning (which is currently not available on Quantopian), might take a bit longer, but these models are also trained on data sets much much bigger than this (a famous subset of the ImageNet data set is 138 GB).

This means that the current bottlenecks are retrieving the data from pipeline (RAM and i/o), not lack of GPU or parallel processing support.

In [23]:

```
Y_pred = clf.predict(X_train_trans)
print('Accuracy on train set = {:.2f}%'.format(metrics.accuracy_score(Y_train_shift, Y_pred) * 100))
```

*Exercise*: It is also common to run cross-validation on the training data and tweak the parameters based on that score, testing should only be done rarely. Try coding a sklearn pipeline with K-Fold cross-validation.

To evaluate our ML classifer on the test set, we have to transform our test data in the same way as our traning data. Note that we are only calling the `.transform()`

method here which does not use any information from the test set.

In [24]:

```
# Transform test data
X_test_trans = imputer.transform(X_test_shift)
X_test_trans = scaler.transform(X_test_trans)
```

In [25]:

```
# Predict!
Y_pred = clf.predict(X_test_trans)
Y_pred_prob = clf.predict_proba(X_test_trans)
```

In [26]:

```
print 'Predictions:', Y_pred
print 'Probabilities of class == 1:', Y_pred_prob[:, 1] * 100
```

In [27]:

```
verg = pd.DataFrame({'orig':Y_test_shift,'pred':Y_pred,'prob':Y_pred_prob[:, 1] * 100})
```

In [28]:

```
#verg
```

*and* confident in your answer. See the Kaggle description for more motivation.

In [29]:

```
print('Accuracy on test set = {:.2f}%'.format(metrics.accuracy_score(Y_test_shift, Y_pred) * 100))
print('Log-loss = {:.5f}'.format(metrics.log_loss(Y_test_shift, Y_pred_prob)))
```

The following would be done by a pipline, where one factor is the probability

In [30]:

```
perc_verg = verg[verg.prob >= np.nanpercentile(verg['prob'],90)].append(verg[verg.prob <= np.nanpercentile(verg['prob'],10)])
```

In [31]:

```
print('Accuracy on test set with selected percentiles = {:.2f}%'.format(metrics.accuracy_score(perc_verg['orig'], perc_verg['pred']) * 100))
print('Number of trades in selected test set: ' + str(len(perc_verg)))
```

Seems like we're at chance on this data set, alas. But perhaps you can do better?

We can also examine which factors the classifier identified as most predictive.

In [74]:

```
feature_importances = pd.Series(clf.feature_importances_, index=results_wo_returns.columns)
feature_importances.sort(ascending=False)
ax = feature_importances.plot(kind='bar')
ax.set(ylabel='Importance (Gini Coefficient)', title='Feature importances');
```

*Exercise*: Use partial dependence plots to get an understanding of how factor rankings are used to predict future returns.

Several knobs can be tweaked to boost performance:

- Add existing factors from the collection above to the data set.
- Come up with new factors
- Use
`alphalens`

to evaluate an alpha for its predictive power. - Look for novel data sources from our partners.
- Look at the 101 Alpha's Project.

- Use
- Improve preprocessing of the ML pipeline
- Is 70/30 the best split?
- Should we not binarize the returns and do regression?
- Can we add Sector information in some way?

- Experiment with feature selection.
- PCA
- ICA
- etc.

- Tweak hyper-parameters of
`AdaBoostClassifier`

. - Try different classifiers of combinations of classifiers.

If you have something you think works well, post it in this thread. Make sure to test over the same time-period as I have here to keep things comparable. In a month from now, we can test on new data that has aggregated since then and determine who built the best ML pipeline. If there is demand, we might turn this into a proper ML contest.

If you look for information on how to get started with ML, here are a few resources:

- Scikit-learn resources
- Learning scikit-learn: Machine Learning in Python
- Pattern Recognition and Machine Learning

As mentioned above, this is not immediately usable in an algorithm. For one thing, there is no `run_pipeline()`

in the backtest IDE. It turns out to be rather simple to take the code above and put it into a pipeline `CustomFactor()`

where the ML model would automatically get retrained and make predictions. You would then long the `1`

predictions and short the `-1`

predictions, apply some weighting (e.g. inverse variance) and execute orders. More on these next steps in the future.

- Content created by James Christopher and Thomas Wiecki
- Thanks to Sheng Wang for ideas and inspiration.
- Jonathan Larkin, Jess Stauth, Delaney Granizo-Mackenzie, and Jean Bredeche for helpful comments on an earlier draft.

In [ ]:

```
```