Multivariate Time Series Analysis Template

If you’re looking for a standard boilerplate to analyse multivariate time-series data, this is for you!

Edward Elson Kosasih
5 min readDec 29, 2021
Photo by Chris Liverani on Unsplash

In this post, I present standard steps that can be used for multivariate time series analysis, including application in forecasting. There are six steps that we will perform:

1. Causality Test
2. Cointegration Test
3. Stationarity Test and Make Stationary
4. Find optimal order p for VAR(p)
5. Residual Serial Correlation Test
6. Forecast and Rollback Transformation

I’ve also created a Jupyter Notebook to accompany this post here. We use the following time series data for our experiment.

Our Time Series Data

Causality Test

Our null hypothesis is that given a pair of time series, they are not related i.e. coefficient of the autoregressive models is 0. If the Granger Causality Test produces test statistic with significantly small p-value, we can reject the null hypothesis and speculate that there might be relationship between the time series.

from statsmodels.tsa.stattools import grangercausalitytests# given (x, y) pair of variables
test_result = grangercausalitytests(df[[y, x]], maxlag=maxlag, verbose=False)
pvalues = [x[0][test][1] for x in list(test_result.values())]
# check if there's any p-value < 0.05
min_p = min(pvalues)
Pairwise Granger Causality Matrix. Since most of the values are <0.05, there might be relationship between almost every pair of variables

Cointegration Test

Our null hypothesis is that given multiple pairs of time series, none of them are related. The Johansen test looks at the rank of the matrix that relates the variables in the autoregressive equation. It checks for every rank value r from 0 to number of variables, and see if the test statistic is larger than the given significance level. If yes, we can reject the null hypothesis and speculate that there might be relationship between r variables.

from statsmodels.tsa.vector_ar.vecm import coint_johansen# given multivariate signal in a dataframe
test_result = coint_johansen(df, det_order=-1, k_ar_diff=5)
# test result and significance level from trace statistic
test_stats, sign_levels = test_result.lr1, test_result.cvt
ranks = range(df.columns.shape[0])
for rank, test_stat, sign_level in zip(ranks, test_stats, sign_levels):
print("Rank <= {} is {}. Test stat: {}. Significance Level: [90%] {}, [95%] {}, [99%] {}".format(rank, test_stat >= sign_level[1], round(test_stat, 4), *sign_level))
We see that up to rank 6, the Johansen cointegration trace statistic test result is above 95% level. There might be relationship between 6 variables in the dataset.

Stationarity Test and Make Stationary

In order to use Vector Autoregressive Model (VAR), we need to first make sure that the data is stationary. We use ADF Test to check if the signal is stationary. Check for every variable in the dataset. The null hypothesis is that this is a non-stationary signal. If we reject the hypothesis, then the signal might be stationary.

from statsmodels.tsa.stattools import adfullertest_result = adfuller(signal, autolag="AIC")
test_result[1] < 0.05 # if p-value < 0.05 then reject H0 of non-stationarity

If any of the time series is not stationary, we perform transformation. In this case we perform differencing repeatedly until every time series is stationary based on ADF Test.

# repeat a few times until we get a stationary signal
signals_diff = signals.diff().dropna().diff().......dropna()

We can get the original signal back by performing cumulative sum added to the removed signal.

After differencing twice, we get a stationarymulti-variate time series.
# example for signals differenced twice
d2 = signals_diff.cumsum()
d1 = (signals.diff().iloc[-1] + d2).cumsum()
d0 = signals.iloc[-1] + d1
# d0 is the original signal

Find optimal order p for VAR(p)

We perform hyperparameter search to find the most optimal order lag p for our VAR model.

model = VAR(signals)
model.select_order(maxlags=12).summary()
output summary table. The best value is marked by *

We fit the model based on the best value. In this case we pick 4 because there’s an inflexion point there (AIC and FPE decreased from 3 to 4 then increased from 4 to 5 and decreased again afterwards).

model_fitted = model.fit(4)
model_fitted.summary()
We can see the fitness result as well as equation coefficients for signals

Residual Serial Correlation Test

We check if the residual of the model still contains leftover pattern. If yes, then we will need to make changes to the model e.g. increase the order of the VAR model. We use Durbin Watson’s statistic to check if there are serial correlations between residuals. If the test statistic is close to 2 means there’s no correlation. Meanwhile, close to 0 means positive and to 4 means negative correlation respectively.

from statsmodels.stats.stattools import durbin_watson

test_result = durbin_watson(model_fitted.resid)
Every time series has DW test close to 2, indicating no serial correlation in the residuals

Forecast and Rollback Transformation

We use the trained VAR model to do forecasting on the data.

lag_order = model_fitted.k_ar
forecast_input = signals_diff.values[-lag_order:]
# output data
forecast_output = model_fitted.forecast(y=forecast_input, steps=n)
diff_forecast = pd.DataFrame(forecast_output, index=df_test.index, columns=df_test.columns)

We need to transform the forecast output data back to the original signal by performing repeated cumulative sum and addition to the original signal.

d2 = diff_forecast.cumsum()
d1 = (df_train.diff().iloc[-1] + d2).cumsum()
d0 = df_train.iloc[-1] + d1
df_forecast = d0

We evaluate the forecast result by looking at several metrics.

References

  1. https://www.machinelearningplus.com/time-series/vector-autoregression-examples-python/

--

--

Edward Elson Kosasih

PhD in Operations Research and Machine Learning at University of Cambridge