Time Series Terminal Whitepaper

Brief Solutions Ltd, May 2022

The Problem

Is time series A predictive of time series B?

i.e. given all information about A's value up to time t, is it predictive of the conditional distribution of B's value at a future time t+p, with p being the prediction horizon?

Is this relationship causal?

i.e. the presence of A makes the prediction of the future of B better, its absence makes it worse. A computational version of this notion is Granger Causality. For a philosophical overview, refer to Causation. Or for a particular discussion, Kant and Hume on Causality.

In our view, doing this computation to detect for "causality" in the past may serve to prune away certain "improbable" future scenarios about the outside world, thus simplifies the way we represent the outside world in our own mind.

The Solution

Given a pre-configured collection of time series data, we designed a computational engine to compute the predictive power of any series A for any series B, for a given prediction horizon, by conducting many random experiments. When we collate the results we obtain a causal graph that represents the strengths of predictive power between them.

For example, following the close of Thu 14 Dec 2023, the computational engine updates all causal relations for 24-week ahead horizon. The vicinity around the S&P 500 ETF SPY.US would read like

link: https://tsterm.com/?q=SPY.US&h=24w&asof=2023-12-14
so it is computed that the history of values (in this case being the price of the stock) up to Thu 14 Dec 2023 of Microsoft MSFT.US determines the value of the S&P 500 ETF SPY.US in 24 weeks, Thu 30 May 2024.

Once the causal predictors are selected, we train a group of models, each to predict a complete conditional distribution of the target time series' value at the forecast horizon, which can be aggregated in different ways for practical use.

For example, for a certain target instrument and a certain target date, two assessors Model 1 and 2 predicted the value change (from now to the target date) at the 25%-, 50%-, 75%-probability levels as below,

25% 50% 75% Expectation
Model 1 -14 3 17 2 (greater than 0)
Model 2 -20 -4 6 -6 (less than 0)
Averaged Distribution -17 -0.5 11.5

Model 1 seems more "optimistic" than Model 2, predicting a higher value change at each probability level. For each probabilty level, if we average the two predicted numbers across the models, we obtain the results in the last row "Averaged Distribution". On tsterm.com the averaged conditional distribution is plotted on charts.

On the other hand, if Model 1 is to take an action solely based on its own predicted conditional distribution, one way may be to first compute the expectation, as average of the predicted three numbers. Depending on whether the expectation is greater or less than zero, it can vote "up" or "down", take a long or short trading position, etc. On tsterm.com the "up" or "down" vote tally is printed in tables.

On the above example, out of two models, one would vote "up", one "down", or 1/2=50% models vote "up", 1/2=50% models vote "down". The net prediction is then 50-50=0% models vote neither "up" or "down". A synthetic trading position, directly based on the net prediction, would have been 0% the person's maximal investing size, a quantity in number of instruments, e.g. 100 AAPL.US stocks, or 2,000 GBP for target instrument the foreign exchange rate GBPUSD, or 500 USD for target instrument USDJPY.

Goodness of the Probabilistic Prediction

Regardless the nature of the time series data set, we can generically test the following hypothesis:

The realised value follows well the predicted conditional distribution?

How to go about it? For each timestamp, we pitch the realised value against the predicted conditional distribution, averaged from the individual ones obtained by each model some time ago. Intuitively, if the realised value always sits in the lower end of the averaged predicted conditional distribution, that means the higher end of that distribution gets realised less often than predicted; vice versa.

We denote by realised probability level how much probability mass in the predicted conditional distribution is below the realised value. It is always a value between 0 and 1. One may prove that if the prediction is well done, the realised probability level should be distributed uniformly between 0 and 1. It cannot hover most often around a particular area between 0 and 1. It is therefore equivalent to testing the hypothesis:

The realised probability level follows the uniform distribution between 0 and 1?

for which we would take a random sample of dates, check out the respective realised values and predicted conditional distributions, and arrive at the realised probability levels. Eventually a statistical test will produce a p-value for how well the realised probability levels are distributed uniformly between 0 and 1.

For example, for prediction of S&P 500 ETF SPY.US, 24-week ahead horizon, let us first look at two target dates Fri 2023-10-20 and Fri 2023-11-24. On the former date the price was going down; on the latter it was going up.

For Fri 2023-10-20, the prediction was made 24 weeks before on Fri 2023-05-05. The realisation 421.19 ends in the lower 50% of predicted outcome scenarios. Precisely, 42% of the predicted values were below the realistion 421.19, the realised probability level was thus 42%, or 0.42.

For Fri 2023-11-24, the prediction was made 24 weeks before on Fri 2023-06-09. The realisation 455.3 ends in the upper 50% of predicted outcome scenarios. Precisely, 56% of the predicted values were below the realisation 455.3, the realised probability level was thus 56%, or 0.56.

We collect a random sample of 20 realised probability levels including the above two 0.42 and 0.56, and plot the empirical distribution of them,

as well as the probability density function of the uniform distribution, which is a constant probability density of 1 for any value between 0 and 1. Can the sampled realised probability levels have come from the uniform distribution between 0 and 1?

Let us run a statistical test which would produce a p-value as an indicator for the goodness of fit of the sample, under the hypothesis it did come from the uniform distribution. The p-value is always a value between 0 and 1 for any statistical test. For our sample above, we obtain p-value 0.0013.

How high should the p-value be for us to not reject the hypothesis that the realised probability level follow the uniform distribution between 0 and 1, or equivalently, the predicted conditional distribution is good? In practice people have different preferences, e.g 0.05, 0.01, 0.001, etc.

For example, if we decide on using the threshold 0.05, when the computed p-value is at least 0.05, we will not reject the hypothesis; when it is below 0.05, we reject it. For our sample, the computed p-value 0.0013 is below 0.05, so we reject the hypothesis that the predicted conditional distribution is good. But if we decide on using the threshold 0.001, we will not reject the hypothesis.

Why the threshold appears rather small, don't we want a high one for high "confidence"? -- As the p-value is also related to the error rate that one reject the hypothesis when the hypothesis were true. When the threshold is 0.05, i.e when the user rejects the hypothesis whenever the computed p-value is below 0.05, then for 0.05 or 5% chance, we make the mistake of rejecting the hypothesis when it were true.

A note on why we take a random sample of realised probability levels rather than every record? Say, under 24-week ahead horizon, we run on daily data, from day to day, this prediction horizon is relatively speaking faraway, for which the predicted conditional distribution is usually wide empirically to account for extreme outcomes. It may then be inevitable that the predicted conditional distribution differs little as made on adjacent days, the realised probability level wouldn't differ much either, introducing strong correlation between them.

Backtest

On financial data, in addition, we can perform a synthetic trading backtest to show the goodness. Below is how we do it. Each model from its predicted conditional distribution makes a vote of "up", "down" or "undecided", for the direction of movement from the calculation time to the specific forecast horizon. We then tally the votes. For example, with the Invesco Nasdaq-100 ETF as the forecast target and 2-day forecast horizon,

The net predictions, in terms of the percentage of models predicting up minus that of models predicting down, are

To translate into the synthetic trading position in a straightforward way:

When the synethic position size is a negative number, e.g. -0.5 shares, it is for 0.5 shares being sold short. As the net prediction can only vary between "100% models predict down" and "100% models predict up", the synthetic position can be of any size between -1 share and 1 share, inclusively.

In the above example, as the prediction is for the conditional distribution of the value two days away with respect to the calculation time, a synthetic trade is opened at the calculation time just after the computation is done and has to be kept open for 2 days, in order to test the goodness of or "monetise" the net prediction,

In the table format,

Wed Thu Fri
Tue -0.5 -0.5
Wed -0.2 -0.2

The single day Thu 26th is affected by two trades, one opened as of Tue close, the other opened Wed close. What is the effective synthetic position for the single day Thu 26th? We simply average the two predictions made on Tue close and Wed close,

1 2 (-0.5+(-0.2))=-0.35 shares.

In this way, the effective synthetic position for any day is always an average of the most recent predictions computed prior to that day. The longer the forecast horizon is, the more items the average is over.

With 2-day ahead forecast horizon, under the ideal condition of no trading cost, the performances would be:

For Invesco Nasdaq-100 ETF,

For SPDR S&P 500 ETF,

For SPDR Dow Jones Industrial Average ETF,

For interest rates, let's switch to a 1-week ahead forecast horizon:

For US 5-Year Treasury Yield Rate,

For US 10-Year Treasury Yield Rate,