So, after training a SARIMAX model (Seasonal ARIMA with Xogenous variables), the objective is to forecast a particular (endogenous) variable using forecasts already provided to us (from external sources) for N other exogenous variables that the model was trained on. However, I do nothing from scratch! Instead, I use off-the-shelf Python libraries as below.
Code: Select all
from statsmodels.tsa.stattools import adfuller
from sklearn.feature_selection import RFE
from xgboost import XGBRegressor
from statsmodels.tsa import SARIMAX
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tsa.stattools import acf, pacf
Here's my approach roughly:
1. Examine the time series for the variable to be forecast (the endogenous variable) for stationary. If it's not stationary, perform a first order differencing and/or natural logarithm to achieve stationary. Use the Augmented Dickey-Fuller Test to test for stationarity. Null hypothesis is that the process isn't stationary, and we want a large t-stat >> ~2.58 or 99% CI.
2. Examine the several (N) exogenous variables for multi-collinearity using the Variance Inflation factor (VIF) analysis, and thereafter use a combination of Recursive Feature Selection (RFE) and XGBoost Regression to identify variables that best explain movements in the endogenous variable. Ultimately, shrink the number of exogenous variables to reduce multi-collinearity (the variables won't necessarily be orthogonal though, some relationship will remain).
3. Either examine the ACF and PACF plots to determine the AR and MA nature of the time-series ( Interpreting ACF and PACF plots - SPUR ECONOMICS ) and/or do a grid search over several combinations of p, d, and q values to obtain the lowest mean absolute percentage error (MAPE) on test data (the training data is divided into training, validation and test data segments). If multiple models have comparable MAPEs, also examine the Akaike Info Criterion (AIC) and Bayesian Info Criterion (BIC) to aid model selection.
4. Examine the density, Q-Q plot and correlogram for the residuals of the final model.
5. Assess model performance each period into the future and re-calibrate as necessary.
I have tried training models both with non-standardized data and standardized data. Also in some cases, the endogenous variable can't take on negative values, so I use logarithm to stationarize the series and use the exponent function after the forecast in order to reverse the logarithm. I am sure there are numerous other better ways to address this.
Taking a step back, I am looking to get a solid grounding with time series analysis and further my understanding. Have dabbled in some online sources in the past, but looking for one solid comprehensive program. Came across this on Coursera The Econometrics of Time Series Data - Queen Mary University of London - Course Info | Coursera and wondering if anyone has done this and/or has suggestions for other courses. I'd like the course to be practical with concepts applied to many real world data.
Thank you for any suggestions.