When the data is indexed in a form where the data points are the magnitude of changes occurring with time, that data can be considered as the time-series data. For example, a unit of sales of any commodity for a particular date, week, month, or year, or change in the temperature with the time. So this is one of the important domains of data science where we forecast the future value according to the history in the time series. In forecasting, we have many models that help us make predictions and forecast the values to fulfil our future aspects according to the situation’s demand. The examples of models can be AR, MA, ARIMA, SARIMA, VAR, SARIMAX etc.
We assume that you have a basic understanding of the time series analysis and basic knowledge about the forecasting models. You can go through the below articles for more details on these topics.
- General Overview Of Time Series Data Analysis.
- Guide To Detrending Using Scipy Signal.
- Comprehensive Guide To Deseasonalizing Time Series.
- A Comprehensive Guide To Regression Techniques For Time Series Forecasting.
- Comprehensive Guide To Time Series Analysis Using ARIMA.
Referring to these articles, you can better understand the time series analysis and understand how the different ARIMA family models work with different time series data.
In this article, first of all, we will read the data and perform the preprocessing steps. After that, we will discuss the ARIMA and SARIMAX models with their implementation. It will be demonstrated that when the seasonality and exogenous factors are available in the time series, how SARIMAX can be a perfect model in this case.
Code Implementation: Reading and Preprocessing the Data
To implement the models, I am using the data to have a monthly sales value of alcohol for 1964 to 1972.
Here I am using Google Colab to run the codes. We are required to mount our drive to the notebook using the following command.
Input:
from google.colab import drive drive.mount('/content/drive')
Importing the libraries.
Input:
import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline
Importing the data set.
Input:
data = pd.read_csv("/content/drive/MyDrive/Yugesh/SARIMAX/data.csv") data.head()
Output:
Here we can see the data where we have got a column on month and a sales column.
Right now, we have values in our month’s column in string objects. We will be required to change it on the DateTime object, and also we need to make it our index column. In the next step, I am changing the data according to the requirement.
Input:
data.index = pd.to_datetime(data['Month']) data.drop(columns='Month',inplace=True) data.head()
Output:
Let’s go for the basic data preprocessing with our data.
Checking the null values in the data
Input:
data.isna().sum()
Output:
Here we can see that there are no null values in the data.
Describing the sales column.
Making a line graph for visualization of the data.
Input:
data.plot()
Output:
Here, by the visualization only, we can see the availability of the seasonality in the data set. In the graph, we can see that the magnitude of the sales is changed repeatedly, showing the changes almost similar for different time intervals. More formally, we can see that for the starting months of any year we are getting a sudden drop in the sales for the starting mon the last year.
Four kinds of components help make a time series, and also they can affect our time series analysis if present in excess. So here, for this time series, we need to check more for the availability of components.
Decomposing the time series.
Input:
from statsmodels.tsa.seasonal import seasonal_decompose decompose_data = seasonal_decompose(data, model="additive") decompose_data.plot();
Output:
Here we can see that the range of trend and residual is nominal, or we can say that trend is having variation between 4000 to 5000, and most of the time residual is having the variation around. But for the seasonality, we can see that it varies between 0 to 5000, which is a high difference range.
We can also extract the plot of the season for proper visualization of the seasonality.
Input:
seasonality=decompose_data.seasonal seasonality.plot(color='green')
Output:
I think now we can easily see the seasonality effect in our time series. In the above image, we have extracted the seasonality from the time series.
To perform forecasting using the ARIMA model, we required a stationary time series. Stationary time series is a time series that is unaffected by these four components. Most often, it happens when the data is non-stationary the predictions we get from the ARIMA model are worse or not that accurate.
If the data is not stationary, we can do one thing: either make the data stationary or use the SARIMAX model.
To know more about the time series stationarity, we can perform the ADfuller test, a test based on hypothesis, where if the p-value is less than 0.05, then we can consider the time series is stationary, and if the P-value is greater than 0.05, then the time series is non-stationary.
Performing the adfuller test on data.
Input:
from statsmodels.tsa.stattools import adfuller dftest = adfuller(data.Sales, autolag = 'AIC') print("1. ADF : ",dftest[0]) print("2. P-Value : ", dftest[1]) print("3. Num Of Lags : ", dftest[2]) print("4. Num Of Observations Used For ADF Regression and Critical Values Calculation :", dftest[3]) print("5. Critical Values :") for key, val in dftest[4].items(): print("\t",key, ": ", val)
Output:
Here we can see that the p-value is higher for our dataset, and we can say that the evidence of the null hypothesis is low; hence the time series is non-stationary. We can make the time series stationary with differencing methods. In this case, we are going ahead with the rolling mean differencing methods. For more methods of differencing, you can refer to this article. The suggested article is mainly focused on deseasonalizing and differencing where also you can get acquaintances with the adfuller test and other methods of differencing.
Often with the data where the effect of seasonality is in excess, we use the rolling mean differencing.
Appling the rolling mean differencing
Input:
rolling_mean = data.rolling(window = 12).mean() data['rolling_mean_diff'] = rolling_mean - rolling_mean.shift() ax1 = plt.subplot() data['rolling_mean_diff'].plot(title='after rolling mean & differencing'); ax2 = plt.subplot() data.plot(title='original');
Output:
Here we can see in the graph the seasonality of before the differencing and after the differencing. We can see that we have reduced a lot of seasonality. We can also proceed for adfuller test where we can compare the p-value.
Input:
dftest = adfuller(data['rolling_mean_diff'].dropna(), autolag = 'AIC') print("1. ADF : ",dftest[0]) print("2. P-Value : ", dftest[1]) print("3. Num Of Lags : ", dftest[2]) print("4. Num Of Observations Used For ADF Regression and Critical Values Calculation :", dftest[3]) print("5. Critical Values :") for key, val in dftest[4].items(): print("\t",key, ": ", val)
Output:
We can see that the p-value is near about zero and very less than 0.05; now, our time series is stationary. So after these all processes, we can move to the modelling side.
ARIMA
In statistics and in time series analysis, an ARIMA( autoregressive integrated moving average) model is an update of ARMA (autoregressive moving average). The ARMA consists of mainly two components, the autoregressive and moving average; the ARIMA consists of an integrated moving average of autoregressive time series. ARIMA model is useful in the cases where the time series is non-stationary. And the differencing is required to make the time series stationary.
Mathematically we can represent the formula like this.
Image source
Where
- C is an intercept of the ARMA model.
- ⃤is the first difference operator.
- ʏ is the time lags
In the ARIMA model, we have to consider three values which we also need to give in our parameters while implementing it. Therefore, we can represent it by (p, d, q).
- P = lags in the autoregressive model.
- D = differencing / integration order.
- Q = moving average lags.
So more formerly if we are saying that ARIMA(1,1,1) which means ARIMA model of order (1, 1, 1) where AR specification is 1, Integration order or shift order is one and Moving average specification is .1
Our basic motive in this time series analysis is to use the ARIMA model to predict the future value and compare it with the SARIMAX model. One of the important parts of time series analysis using python is the statsmodel package. This provides most of the model and statistical tests under one roof, and also earlier in the article, we have used it so many times.
Implementation of the model without differencing.
Importing the model.
Input:
from statsmodels.tsa.arima_model import ARIMA
Defining the model using sales column.
Input:
model=ARIMA(data['Sales'],order=(1,1,1)) history=model.fit()
Checking the summary of the model.
Input:
history.summary()
Output:
Testing of the model.
Input:
data['forecast']=model_fit.predict(start=90,end=103,dynamic=True) data[['Sales','forecast']].plot(figsize=(12,8))
Output:
Here we can easily see the results we have got by the model is very unsatisfactory. This is because we have fit the model with a non-stationary time series. Without the stationary data, the model is not going to perform well.
Next, we are going to apply the model with the data after differencing the time series.
Fitting and training the model.
Input:
model=ARIMA(data['rolling_mean_diff'].dropna(),order=(1,1,1)) model_fit=model.fit()
Testing the model.
Input:
data['forecast']=model_fit.predict(start=90,end=103,dynamic=True) data[['rolling_mean_diff','forecast']].plot(figsize=(12,8))
Output:
Here we can see that our forecast is lying approximately on the given data in all processes we are trying to make predictions on available data and the values are quite satisfying but not the data we used after differencing, which means the values we are going to predict also without the seasonality effect or any other affecting components. So to get rid of the situation, we can use the SARIMAX model. So let’s have an overview of SARIMAX.
SARIMAX
SARIMAX(Seasonal Auto-Regressive Integrated Moving Average with eXogenous factors) is an updated version of the ARIMA model. ARIMA includes an autoregressive integrated moving average, while SARIMAX includes seasonal effects and eXogenous factors with the autoregressive and moving average component in the model. Therefore, we can say SARIMAX is a seasonal equivalent model like SARIMA and Auto ARIMA.
Another seasonal equivalent model holds the seasonal pattern; it can also deal with external effects. This feature of the model differs from other models. For example, in a time series, the temperature has seasonal effects like it is low in winter, high in summers. Still, with the effect of external factors like humidity, the temperature in winter is increased and also due to rain, there is a chance of lower temperature. We can’t predict the exact value for these factors if they do not appear in a cyclic or any seasonal behaviour. Other models are not capable of dealing with this kind of data.
In the SARIMAX models parameter, we need to provide two kinds of orders. The first one is similar to the ARIMAX model (p, d, q), and the other is to specify the effect of the seasonality; we call this order a seasonal order in which we are required to provide four numbers.
(Seasonal AR specification, Seasonal Integration order, Seasonal MA, Seasonal periodicity)
Mathematically we can represent the model like this.
Image source
Where :
Image source
Let’s see how we can implement this model on our dataset.
Importing the package where the model is available.
Input:
import statsmodels.api as sm
Fitting the model into time series.
Input:
model=sm.tsa.statespace.SARIMAX(data['Sales'],order=(1, 1, 1),seasonal_order=(1,1,1,12)) results=model.fit()
Testing the fitted model.
Input:
data['forecast']=results.predict(start=90,end=103,dynamic=True) data[['Sales','forecast']].plot(figsize=(12,8))
Output:
Here in the graph, we can see the results: the forecasting line is almost lying on the given values for this model. We didn’t even require the differencing method. Using this model now, we can predict the future values too.
Making a NAN value future dataset.
Input:
from pandas.tseries.offsets import DateOffset pred_date=[data.index[-1]+ DateOffset(months=x)for x in range(0,24)]
Giving similar names to columns.
Input:
pred_date=pd.DataFrame(index=pred_date[1:],columns=data.columns) pred_date
Output:
To make forecasted values, we need to concate this blank data with our alcohol sales data.
Input:
data=pd.concat([data,pred_date])
Making the prediction using the model we have created before.
Input:
data['forecast'] = results.predict(start = 104, end = 120, dynamic= True) data[['Sales', 'forecast']].plot(figsize=(12, 8))
Output:
We can see that the model has predicted the values without compromising with the seasonality effects and exogenous factors. And the trend line is almost going as usual as it was going in past years.
This article has seen how the SARIMAX becomes useful when the seasonality and exogenous factors are available in the time series. In the current scenario, many factors affect the trend of the time series, and in this situation, it gets difficult to predict accurately. Therefore, I encourage you to go deeper into the model and determine how it can get accurate in prediction.