Notebook

Pipeline Misses Stocktwits Datapoints from Psychsignal

This notebook is an example case of how pulling in stocktwits psychsignal data using Pipeline excludes certain days. I belive Pipeline is currently designed to run on the equities trading calendar. Therefore, for each weekday, Pipeline pulls data from the most recently available day. For this dataset, this usually causes Monday's query to pull data from Sunday (unless Sunday data is missing. If this is the case it will pull either Saturday or Friday's data). Since Friday's query pulls data from Thursday, this typically misses the Saturday and Friday data.

In [1]:
from quantopian.interactive.data.psychsignal import stocktwits

# import data operations
from odo import odo
import pandas as pd
from blaze import by

Access the Data Interactively using Blaze

In [2]:
aapl_sentiment = odo(stocktwits[stocktwits['symbol']=='AAPL'], pd.DataFrame)
aapl_sentiment.head()
Out[2]:
source symbol bullish_intensity bearish_intensity bull_minus_bear bull_scored_messages bear_scored_messages bull_bear_msg_ratio total_scanned_messages sid asof_date timestamp
0 stocktwits AAPL 2.3 0.00 2.30 1.0 0.0 0.0 1.0 24 2009-07-10 04:00:00 2009-07-11 04:00:00
1 stocktwits AAPL 0.0 0.00 0.00 0.0 0.0 0.0 3.0 24 2009-07-11 04:00:00 2009-07-12 04:00:00
2 stocktwits AAPL 0.0 2.80 -2.80 0.0 1.0 0.0 4.0 24 2009-07-13 04:00:00 2009-07-14 04:00:00
3 stocktwits AAPL 0.0 0.00 0.00 0.0 0.0 0.0 1.0 24 2009-07-28 04:00:00 2009-07-29 04:00:00
4 stocktwits AAPL 0.0 2.15 -2.15 0.0 1.0 0.0 2.0 24 2009-07-30 04:00:00 2009-07-31 04:00:00

This shows that the weekend data does exist in the dataset.

In [3]:
DAYS_OF_WEEK = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_of_week_dict = dict(zip(range(7), DAYS_OF_WEEK))

aapl_sentiment['asof_day_of_week'] = aapl_sentiment['asof_date'].map(lambda x: day_of_week_dict[x.dayofweek])
aapl_sentiment.loc[:,'asof_date': 'asof_day_of_week'].tail(7)
Out[3]:
asof_date timestamp asof_day_of_week
3155 2018-04-24 2018-04-25 10:06:32.031732 Tuesday
3156 2018-04-25 2018-04-26 10:05:58.838941 Wednesday
3157 2018-04-26 2018-04-27 10:06:02.745156 Thursday
3158 2018-04-27 2018-04-28 10:05:59.176612 Friday
3159 2018-04-28 2018-04-29 10:06:03.421753 Saturday
3160 2018-04-29 2018-04-30 10:06:27.914562 Sunday
3161 2018-04-30 2018-05-01 10:06:02.959248 Monday
In [4]:
start_date = aapl_sentiment.asof_date.iloc[0]
end_date = aapl_sentiment.asof_date.iloc[-1]
 

print "Count of Observations by Day of Week using Blaze"
print "{} to {}".format(start_date.date(), end_date.date())
aapl_sentiment.asof_day_of_week.value_counts()[DAYS_OF_WEEK]
Count of Observations by Day of Week using Blaze
2009-07-10 to 2018-04-30
Out[4]:
Monday       455
Tuesday      456
Wednesday    451
Thursday     454
Friday       454
Saturday     443
Sunday       449
Name: asof_day_of_week, dtype: int64

Access via Pipeline

In [5]:
from quantopian.pipeline import Pipeline
from quantopian.research import run_pipeline
from quantopian.pipeline.data import Fundamentals  
from quantopian.pipeline.data.psychsignal import stocktwits
In [6]:
def make_pipeline():
    universe = Fundamentals.symbol.latest.eq('AAPL')
    
    return Pipeline(columns={'bullish_intensity': stocktwits.bullish_intensity.latest,
                             'bearish_intensity': stocktwits.bearish_intensity.latest,
                             'total_scanned_messages': stocktwits.total_scanned_messages.latest,
                             'asof_date': stocktwits.asof_date.latest},
                   screen=universe)
    
In [7]:
result = run_pipeline(make_pipeline(), start_date, end_date)  
In [8]:
result['asof_day_of_week'] = result['asof_date'].map(lambda x: day_of_week_dict[x.dayofweek])

print "Count of Observations by Day of Week using Pipeline"
print "{} to {}".format(start_date.date(), end_date.date())
result['asof_day_of_week'].value_counts()[DAYS_OF_WEEK]
Count of Observations by Day of Week using Pipeline
2009-07-10 to 2018-04-30
Out[8]:
Monday       464
Tuesday      461
Wednesday    439
Thursday     439
Friday         7
Saturday       3
Sunday       403
Name: asof_day_of_week, dtype: int64

From the table above, it looks like pipeline typically misses the Friday and Saturday datapoints (as expected).<br><br> Let's look at one of the dates where we actually do get Friday/Saturday data.

In [9]:
result[result['asof_day_of_week'].isin(['Friday', 'Saturday'])]
Out[9]:
asof_date bearish_intensity bullish_intensity total_scanned_messages asof_day_of_week
2009-07-13 00:00:00+00:00 Equity(24 [AAPL]) 2009-07-11 04:00:00 0.00 0.000000 3.0 Saturday
2009-08-03 00:00:00+00:00 Equity(24 [AAPL]) 2009-07-31 04:00:00 0.00 0.000000 1.0 Friday
2009-08-04 00:00:00+00:00 Equity(24 [AAPL]) 2009-07-31 04:00:00 0.00 0.000000 1.0 Friday
2009-08-17 00:00:00+00:00 Equity(24 [AAPL]) 2009-08-14 04:00:00 0.00 2.880000 1.0 Friday
2009-08-18 00:00:00+00:00 Equity(24 [AAPL]) 2009-08-14 04:00:00 0.00 2.880000 1.0 Friday
2009-08-24 00:00:00+00:00 Equity(24 [AAPL]) 2009-08-21 04:00:00 0.00 0.000000 1.0 Friday
2009-09-21 00:00:00+00:00 Equity(24 [AAPL]) 2009-09-18 04:00:00 2.89 2.040000 16.0 Friday
2009-09-28 00:00:00+00:00 Equity(24 [AAPL]) 2009-09-26 04:00:00 0.00 0.500000 2.0 Saturday
2010-08-16 00:00:00+00:00 Equity(24 [AAPL]) 2010-08-13 04:00:00 1.26 1.800000 21.0 Friday
2017-03-20 00:00:00+00:00 Equity(24 [AAPL]) 2017-03-18 00:00:00 1.67 1.916667 115.0 Saturday

Focus in on March 20, 2017 and surrounding days

In [10]:
# Pipeline
result.loc['2017-03-16':'2017-03-23']
Out[10]:
asof_date bearish_intensity bullish_intensity total_scanned_messages asof_day_of_week
2017-03-16 00:00:00+00:00 Equity(24 [AAPL]) 2017-03-15 1.718451 1.878111 759.0 Wednesday
2017-03-17 00:00:00+00:00 Equity(24 [AAPL]) 2017-03-16 1.650455 1.770859 472.0 Thursday
2017-03-20 00:00:00+00:00 Equity(24 [AAPL]) 2017-03-18 1.670000 1.916667 115.0 Saturday
2017-03-21 00:00:00+00:00 Equity(24 [AAPL]) 2017-03-20 1.701296 1.815703 531.0 Monday
2017-03-22 00:00:00+00:00 Equity(24 [AAPL]) 2017-03-21 1.865872 1.792753 1149.0 Tuesday
2017-03-23 00:00:00+00:00 Equity(24 [AAPL]) 2017-03-22 1.815729 1.784314 677.0 Wednesday
In [11]:
aapl_sentiment.set_index('asof_date', inplace=True)
In [12]:
aapl_sentiment.loc['2017-03-16':'2017-03-23', ['timestamp', 'asof_day_of_week','total_scanned_messages']]
Out[12]:
timestamp asof_day_of_week total_scanned_messages
asof_date
2017-03-16 2017-03-17 10:06:11.754931 Thursday 472.0
2017-03-17 2017-03-18 10:06:47.462614 Friday 589.0
2017-03-18 2017-03-19 10:06:34.968719 Saturday 115.0
2017-03-20 2017-03-21 10:05:58.538213 Monday 531.0
2017-03-21 2017-03-22 10:06:27.910286 Tuesday 1149.0
2017-03-22 2017-03-23 10:05:56.268730 Wednesday 677.0
2017-03-23 2017-03-24 10:06:57.305158 Thursday 630.0

There is a missing datapoint on March 19th in the interactive data. This is why Pipeline pulls Saturday's data for that date (because on Monday, Pipeline is pulling the most recently available data, which typically is on Sunday, but for this data point it was on Saturday).

Summary

While this behavior is not unexpected, it would be nice to be able to pull the Friday/Saturday data using pipeline. I do not know enough about how Pipeline works on the backend to suggest the best way to fix this, but the purpose of this notebook is just to document the issue.