By Delaney Mackenzie

Log returns are commonly used in quant research as real data tends to be distributed with a fat tail. This is common in self-reinforcing/autocorrelated systems like financial markets, but the result is that normal distributions will heavily underestimate the likelihood of rare events. See our full lecture on this here:

https://www.quantopian.com/lectures/autocorrelation-and-ar-models

Some stock prices are closer to a normal distribution when log transformed, so in the course of doing research it can be helpful to log transform your data before fitting models. Remember that at the end of the day prices are still prices, so don't assume just because log transformed returns are well behaved that you're not vulnerable to tail events.

Also, many stock prices are not log-normally distributed.

In this example, you'll see that the asset is not truly log-normal, instead it was just a smaller sample size that caused it to pass the test.

This notebook is just a quick piece showing how to log transform prices.

In [1]:

```
import numpy as np
import pandas as pd
# This is a plotting library for pretty pictures.
import matplotlib.pyplot as plt
```

In [2]:

```
# Research environment functions
from quantopian.research import returns, log_returns, symbols
# Select a time range to inspect
period_start = '2012-01-01'
period_end = '2012-06-01'
# Query returns data for AAPL
# over the selected time range
R = returns(
assets=symbols('XLE'),
start=period_start,
end=period_end,
)
log_R = log_returns(
assets=symbols('XLE'),
start=period_start,
end=period_end,
)
# Display first 10 rows
R.head(10)
```

Out[2]:

Let's look at the data distribution.

In [3]:

```
plt.hist(R, bins=20)
plt.xlabel('Return')
plt.ylabel('Observations');
```

Let's also run a statistical normality check.

In [4]:

```
from scipy.stats import normaltest
```

In [5]:

```
significance_level = 0.05
result = normaltest(R)
if result.pvalue < significance_level:
print('Data likely not normally distributed.')
else:
print('Data likely normally distributed.')
```

Log transforms are `NaN`

on negative data, so we have to use rational returns instead of percent returns.

In [6]:

```
np.log(R).tail()
```

Out[6]:

In [7]:

```
rational_R = R + 1
```

In [8]:

```
np.log(rational_R).tail()
```

Out[8]:

However, we got log returns already from the built in method, so we'll just use those.

In [9]:

```
log_R.tail()
```

Out[9]:

In [10]:

```
plt.hist(log_R, bins=20)
plt.xlabel('Return')
plt.ylabel('Observations');
```

In [11]:

```
significance_level = 0.05
result = normaltest(log_R)
if result.pvalue < significance_level:
print('Data likely not normally distributed.')
else:
print('Data likely normally distributed.')
```

Often though, returns will still not be normally distributed even with a log transform. Don't apply this blindly without checking.

Here we see that just by expanding the window and gathering more data, the test gains more power and differentiates the returns distribution we observe from a normal one. It would seem that the true process was likely not normal in the first place, it was just that we had too few samples to realize this.

In [12]:

```
# Select a time range to inspect
period_start = '2012-01-01'
period_end = '2016-01-01'
# Query returns data for AAPL
# over the selected time range
R = returns(
assets=symbols('XLE'),
start=period_start,
end=period_end,
)
rational_R = R + 1
log_R = np.log(rational_R)
log_R.tail()
```

Out[12]:

In [13]:

```
plt.hist(log_R, bins=20)
plt.xlabel('Return')
plt.ylabel('Observations');
```

In [14]:

```
significance_level = 0.05
result = normaltest(log_R)
if result.pvalue < significance_level:
print('Data likely not normally distributed.')
else:
print('Data likely normally distributed.')
```

Let's use pipeline to get a whole bunch of stocks returns.

In [15]:

```
# Pipeline imports
from quantopian.research import run_pipeline
from quantopian.pipeline import Pipeline
from quantopian.pipeline.factors import Returns
# Pipeline definition
def make_pipeline():
returns = Returns(window_length=2)
return Pipeline(
columns={
'daily_returns': returns,
},
)
# Pipeline execution
data_output = run_pipeline(
make_pipeline(),
start_date='2012-1-1',
end_date='2013-1-1'
)
```

In [16]:

```
data_output = data_output.unstack()
data_output = np.log(data_output+1)
data_output.head()
```

Out[16]:

In [17]:

```
data_output.shape
```

Out[17]:

This will incur a lot of multiple comparisons bias.

https://www.quantopian.com/lectures/p-hacking-and-multiple-comparisons-bias

In [18]:

```
num_assets = data_output.shape[1]
num_normal = 0
for i in range(num_assets):
# Get the series for the asset
log_R = data_output.iloc[:,i]
result = normaltest(log_R)
if result.pvalue >= significance_level:
num_normal += 1
```

In [19]:

```
print 'The percent of stocks which are likely normally distributed: %s%%' %(float(num_normal) / num_assets * 100)
```

This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Quantopian, Inc. ("Quantopian"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, Quantopian, Inc. has not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information, believed to be reliable, available to Quantopian, Inc. at the time of publication. Quantopian makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.