Published in · 8 min read · Mar 28, 2020
--
While our planet remains in a state of lock-down due to notorious Novel Coronavirus (COVID19), I decided to utilize some of my time in developing a Machine Learning Model which would forecast number of confirmed cases and dead cases by coronavirus. Nevertheless, I would pray to the Almighty to curtail these numbers to null or nan .
I got the live streaming data from John Hopkins University’s github repository.
Let us talk about the modelling part now.
Since, for most of the countries time series plot was pretty simple as the number of cases were growing exponentially. So, I applied the Holt Winter’s method which is used for exponential smoothing. I set trend and smoothing level to be “mul” and 1.0 respectively. The numbers were increasing exponentially and predicted data was very much dependent on data points closer rather than being far. So as per Holt Winter’s method, smoothing level (0<alpha<1) needs to be tending to 1 if we want forecast to be strictly dependent on latest dates. In the figure below, number of confirmed cases in Iran has been plotted.
The curve looks flat in the beginning until last week of February but there is a sudden surge in numbers from first week of March.
The Holt winter’s method yielded great results with Mean Absolute Percentage error(MAPE) being in the range of 4% — 10%. This approach was working well for countries with significant growth in number of confirmed cases such as Italy, Spain, Iran and USA. But for countries like China,Singapore and south Korea where the epidemic has been contained to a certain extent the curve was tending to be saturated.
So these cases required overall data to be paid attention rather than putting more weight to the latest data. So the hyperparameters needed to be changed for these cases for MAPE to be in safe range.
Since, these involved lots of manual intervention , we decided to build a robust model which will take care of the curve whether it is flat or exponential. Let’s say, in future, Italy contains the spread of Coronavirus the Holt Winter will not yield a good result with same hyper-parameters. However, I liked the fact that Holt Winter’s method is pretty easy to implement for a particular case but it is difficult to generalize it for every case.
And the above fact set the context of this article — “ a Robust LSTM time series Model ”. Let us talk about the technical details now.
Requirements:
Jupyter notebook with Python 3 or above
pandas , numpy, sklearn, Keras, tensorflow
Step 1 : Data Collection
John Hopkins University has been publishing time series data for confirmed, recovered and death cases every day for each country here. I will be taking confirmed cases for the time being as I want my article to be precise. For recovered and dead cases you can replicate the same process.
Reading data for Iran and munging it to a proper time series format.
Plot of confirmed,dead and recovered cases in Iran
Step 2: Data Pre-processing
Since there is enormous variation in data we will be taking small steps for prediction . Here we are taking 5 steps i.e validation set would be of 5 data points and rest would be training set. Since data is available from 22 jan so there are total 66 days or 66 data points.
validation set= 5 data points
train= total-validation set= 66–5=61 data points
Since the data is heavily skewed (it starts from zero and goes up to thousands), we will normalize the data(divide very value by max value of the training set) based on the training set.
Time Series Generator
Time series Generator is a Utility class for generating batches of temporal data in keras i.e. producing batches for training/validation from a regular time series data. These batches will be fed to train the model.
For our case, we are taking 5 steps i.e taking 5 data points in account to predict 6th data point. So, The batches would be [feed=[t1,t2,t3,t4,t5],predict=t6].
the flow would be
x → Neural Network/update weights → y
Step 3 : Model Building
We are implementing LSTM(Long short Term Memory) algorithm using Keras. Since any neural network is multi-layered (input → hidden layers → output layers) we are using Sequential class of keras library.
I am taking number of neurons to be around 150. However we can always get optimum number of neurons with grid search and k-fold cross validation. But as a preliminary approach , we can use the following formula.
Where, i = 3/2 * h *input_points
h is number of hidden layers, i is input neuron, input_points is number of training data points.
we have 2 hidden layers and 1 output neuron so, input neuron = 2 * 3/2 * 60 = 180. I have rounded it to 150.
Initially, we had just one layer with 150 neurons but in order to improve Accuracy and Model robustness we added one more layer with 75 neurons. So the structure of the model would be — -> pass 150 neurons to a LSTM layer → shrink the output to 75 neurons to be fed to a dense layer → one more dense layer which will further shrink the output to be 1
Activation function are non-linear transformations which breaks any linearity if present in the data.Activation functions are really important for a Artificial Neural Network to learn and make sense of something really complicated and Non-linear complex functional mappings between the inputs and response variable.A Neural Network without Activation function would simply be a Linear Model, which has limited power and does not performs good most of the times. We want our Neural Network to not just learn and compute a linear function but something more complicated than that.
We have chosen the optimizer to be “adam” . Adam is an optimization algorithm that can be used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data.It is appropriate for problems with very noisy/or sparse gradients, Computationally very efficient and requires less hyper-parameter tuning.
Our loss function is “mean squared Error loss/quadratic loss”. MSE is the sum of squared distances between our target variable and predicted values.
validation set
we have also introduced validation set to calculate loss and MAPE. Recall, validation set is equal to number of step size i.e. 5 while batches are of the shape [(t1,t2,t3,t4,t5),(t6)] i.e [(1,5,1),(1,1)]. So total data points required would be 6 i.e. one of the data point we have to take from the train set.
Training the model
We have early stopping mechanism of keras which stops training when a monitored quantity has stopped improving. In other words, if val_loss doesn’t reduce in upcoming epochs the training stops. We have given patience to be 20 i.e we have to not stop training untill 20 iterations even there is no improvement in quality. There is an important flag called “restore_best_weights” which takes the best weight from the iterations(where val loss is minimum).
Step 4 : Model Performance
The model looks good as training loss and validation loss are overlapping at a minimal value. Also, both the curves are getting flattened as the number of epochs increases.
If we look at both the losses separately
both the graphs tend to follow each other after getting trained for a couple of epochs and saturates at similar values afterwards. This validates our model being well trained.
Step 5: Forecast
We will forecast the number of confirmed cases in Iran for validation set and next 7 days from today.
The output is a normalized data so we apply inverse transformations on the following.
Restructuring the array to a readable pandas dataframe.
Plot the curve for original and predicted data
Mean Absolute Percentage Error(MAPE)
And accuracy would be 100-MAPE = ~ 93%
Calculation of prediction interval(95% confidence level,CL)
for 95% CL, t-multiplier is 1.96 which is calculated from degree of freedom of the sample and CL required
t-multiplier * standard_deviation gives the magnitude of interval.
and min and max range is given by :
min = value-interval
max = value + interval
Conclusion
The above approach can be replicated for number of death and recovered cases for each country. Just we have to run the model in a loop and get the predictions.
However, I read an article on The Washington Post that temperature and humidity may slow down the spread of the virus. Which explains the reason that spread remains relatively low in countries like Cambodia , Laos and Vietnam as compared to Europe or US.
So, Probably, next time I would try to model a multivariate time series with variables like Temperature , Humidity and Population Density.
Feel free to ask in the comment section in case of any suggestions/queries.
I would like to thank Jose Portilla for his amazing Tutorial on Time Series on Udemy.