Prediction of COVID-19 Confirmed Cases after Vaccination: Based on Statistical and Deep Learning Models

In this paper, we analyze and predict the number of daily confirmed cases of coronavirus (COVID-19) based on two statistical models and a deep learning (DL) model; the autoregressive integrated moving average (ARIMA), the generalized autoregressive conditional heteroscedasticity (GARCH), and the stacked long short-term memory deep neural network (LSTM DNN). We find the orders of the statistical models by the autocorrelation function and the partial autocorrelation function, and the hyperparameters of the DL model, such as the numbers of LSTM cells and blocks of a cell, by the exhaustive search. Ten datasets are used in the experiment; nine countries and the world datasets, from Dec. 31, 2019, to Feb. 22, 2021, provided by the WHO. We investigate the effects of data size and vaccination on performance. Numerical results show that performance depends on the used data's dates and vaccination. It also shows that the prediction by the LSTM DNN is better than those of the two statistical models. Based on the experimental results, the percentage improvements of LSTM DNN are up to 88.54% (86.63%) and 90.15% (87.74%) compared to ARIMA and GARCH, respectively, in mean absolute error (root mean squared error). While the performances of ARIMA and GARCH are varying according to the datasets. The obtained results may provide a criterion for the performance ranges and prediction accuracy of the COVID-19 daily confirmed cases.


Introduction
The coronavirus outbreak in Wuhan, China, in December 2019, and named COVID-19 by the World Health Organization (WHO), made 2020 the year of global disaster [1]. Since the first death from the disease was reported in January 2020, the numbers of confirmed cases and death cases have continuously increased until the vaccination began on Dec. 8th in the United Kingdom (UK). Currently, the numbers of confirmed cases are declining in countries where vaccinations have begun, such as the USA and the UK, while the numbers are still increasing or fluctuating in other countries where vaccinations have started late or have not yet begun.
Symptomatic treatment and supportive therapy are used to cure the COVID-19 patients. It includes basic disease treatment, symptom relief, effective protective and supportive treatment of internal organs, active prevention and treatment of complications, and respiratory support if necessary. Researchers are working on the development of treatments for the disease, and countries are supporting it. As a result, several types of vaccines have developed. However, the drugs that can cure the disease have not yet developed.

154
After the outbreak of the disease, each country does its best to protect its people. For example, various policies are in place, such as limiting people gathering events, restricting overseas travel, and quarantining people from abroad to prevent the influx of corona from abroad. Nevertheless, the increasing trend, rate of increment, and the number of confirmed cases vary from country to country depending on several factors, such as culture, policy, health care, and social habits. Imran et al. (2020) [3] analyzed the reaction of people from different cultures to the COVID-19 and sentiment about their subsequent actions. The authors applied the deep long short-term memory (LSTM) to the extracted tweets. The damage caused by the disease is expected to appear in all fields, including the economy, society, culture, and emotions of individuals, in the next few years worldwide.
In this paper, we analyze the COVID-19 based on the number of daily confirmed cases. As the data is time series, we consider time series models of statistics and deep learning (DL) technology to predict the number of daily confirmed cases; ARIMA, generalized autoregressive conditional heteroscedasticity (GARCH), and stacked LSTM deep neural network (LSTM DNN). The prediction procedure of models consists of three parts; preprocessing process, training process, and prediction process. The min-max transformation is used to preprocess the datasets. The autocorrelation function (ACF) and the partial autocorrelation function (PACF) are used to find the orders of the statistical models, while the sub-optimal hyperparameters of the DL model, such as the number of LSTM cells and the number of blocks in an LSTM cell, are found exhaustively. Data from Dec. 31, 2019, to Feb. 22, 2020, provided by the WHO [2], is used in the experiment. The models are applied to ten datasets of daily confirmed cases; nine countries across the continents and the world. Datasets of two sizes are used in the experiment to investigate the effects of data size and vaccination on performance. Numerical results show the effects and show the stacked LSTM DNN outperforms the statistical models.
The motivation is as follows: 1) Can the statistical models and the DL techniques provide acceptable predictive performances for the new disease before and after vaccination? 2) Which model can predict the disease best? Several articles have considered predicting the disease using statistical models and ML models [e.g. 15,20,[21][22][23]. Since these studies used short-term data of the disease, it was insufficient for learning. The number of confirmed cases continues to increase as the disease spread. However, it tends to decline in countries where vaccination has begun. Therefore, it is meaningful to apply the models with the larger datasets and investigate the effect of vaccination on performance. Besides, this is the first study to apply the GARCH model to the dataset of COVID-19.
This study includes: Section 2 describes the predictive models and procedures for daily confirmed cases. Section 3 and Section 4 present performance measurements and the experimental results, respectively, and Section 5 provides a conclusion.
There are a lot of studies that investigate the models for the COVID-19 in 2020. The used methods for the model include the mathematical models [7][8][9][10][11][12][13][14][15][16][17][18][19], statistical modeling [20][21][22], and artificial intelligence (AI) models [15,[22][23][24]. A compartmental mathematical model was proposed as a spreading model of the disease, emphasizing the potential for transmission of super-spreaders individuals, in Ndairou et al. (2020) [7]. It studied the threshold of reproduction number, local stability of disease-free equilibrium using the number, and the model's sensitivity for parameters. Kucharski et al. (2020) [8] considered a stochastic transmission model to estimate transmission variation over time. The probability of newly confirmed cases that generate outbreaks in other areas was calculated based on the estimation. Zhao et al. (2020) [9] estimated the reproduction number in the early stage of the disease through the curve of confirmed cases in China. The reproduction number was also estimated in Shen et al. (2020) [10] study through a dynamic model, based on Chinese data, from which the epidemic peak time and size were predicted. A new epidemic model that can explain the impact of health care capacity was proposed in Cakan (2020) [11]. In the model, local stability and global stability were studied. Wu et al. (2020) [12] introduced the susceptible-exposed-infectious recovered (SEIR) model to simulate the Wuhan epidemic. The authors estimated the spread of the disease nationally as well as globally by the model. Shah et al. (2020) [13] proposed a generalized SEIR model for the disease, in which the behavior of transmission of the disease was investigated under different control strategies. In the model, the authors considered transmissions between humans and formulated the reproduction number to analyze transmission 155 dynamics of coronavirus outbreak. Intissar (2020) [14] reinvestigated the SEIR model in Shah et al. (2020) [13] work. They considered the local and global stability conditions by using a reproduction number and added some control parameters to force the trajectories to go to the equilibria in the five-dimensional Covid-19 system. Zheng et al. (2020) [15] proposed an improved susceptible-infected model to estimate the variety of infection rates for analyzing the transmission laws and development trend. The model contains the natural language processing (NLP) module and the LSTM. Fanelli & Piazza (2020) [16] analyzed the temporal dynamics of disease outbreaks in China, Italy, and France. It indicated the universality of epidemic spreading based on the analysis of simple day-lag maps and proposed simple mean-field models to collect a quantitative picture of the epidemic spreading. Choi & Ki (2020) [17] considered the transmission model, the reproduction number, and the effectiveness of preventive measures of the disease that fits S. Korea through the number of confirmed cases of S. Korea. Ivorra et al. (2020) [18] proposed the disease spread mathematical model and investigated the detected portion among all infected cases. Chen et al. (2020) [19] developed a simplified transmission network model for the disease by stimulating the potential transmission from the infection source to the human infection, and then computed the reproduction number based on the model. Roy et al. (2020) [20] predicted epidemiological patterns of prevalence and incidence of the disease with ARIMA, using cumulative confirmed cases of the disease in Indian states. A hybrid methodology, wavelet-autoregressive integrated moving average (W-ARIMA), was proposed in Singh et al. (2020) [21]. They used the number of daily deaths of five countries to validate their method, estimated one month-ahead prediction of death cases, and compared its performance with ARIMA. Singh et al. (2020) [22] considered ARIMA and least square support vector machine (LS-SVM) to predict confirmed cases. The data consisting of daily confirmed cases of SARS-CoV-2 in the most affected five countries was used for modeling and predicting one-month confirmed cases of this disease. Shahid et al. (2020) [23] predicted confirmed cases, deaths cases, and recoveries cases of the disease through ARIMA, support vector regression (SVR), LSTM, and bidirectional LSTM. The study used datasets of ten countries. For the early-stage treatment of the disease, the analysis of chest X-rays of infected patients was a crucial step. A model, based on an Auxiliary Classier Generative Adversarial Network (ACGAN), was developed to generate image data in Waheedi et al. (2020) [24].
There are several survey articles on the disease [25][26][27]. Latif et al. (2020) [25] surveyed various research activities on the disease, including statistical and artificial intelligence (AI) modelings and data visualization, which can be used in data management, such as storing, processing, training, predicting, and insight extracting. Emphasizing the importance of responding to the COVID-19 outbreak and preventing the severe effects of the disease pandemic, Pham et al. (2020) [26] overviewed AI and big data in various areas, identified the applications aimed at fighting against COVID-19, highlighted challenges and issues associated with state-of-the-art solutions, and recommended for effective control of the COVID-19 situation. Chamola et al. (2020) [27] investigated the key aspects of the disease, focusing on its impact on the global economy, and considered the use of technologies, including the internet of things (IoT) and AI, to mitigate the outbreak of disease.

Materials and Methods
Since the number of daily confirmed cases of COVID-19 is a time series, we denote it as a process { } .
 tt X For the analysis of time series, the statistical models, such as ARIMA and GARCH, have been considered traditionally, and the NN models, such as multi-layer perceptron (MLP) and LSTM recurrent neural network (RNN), have been used recently (e.g. Kim (2020 and 2021) [28,29]). In this study, we consider two statistical models and one DL model. We define terms and explain models and prediction procedures of the models in this section. Firstly, we define n-step ahead prediction as follows: Xn the conditional expectation of tn X  given that t is known. Here t is the entire history up to time t generated by{ : distribution with mean zero and variance one, for all t. The process such that  d t X is GARCH(p,q), we call it The statistical methods dealing with time series require testing the stationary property of data in advance. If a given data is non-stationary, take the stationary test to the first differenced dataset of the data, { }.
The test repeats until getting the stationary process by increasing d. If the test passes, we have to find the model's orders of the process. ARIMA and GARCH generally use the ACF and the PACF to find the orders. The determined orders are used in the training process, and the time series is predicted based on the trained results.   We will explain Algorithm 1 based on the dataset of daily confirmed cases. Figure 1 shows the number of daily confirmed cases of the world and nine countries over 420 days from Dec. 31, 2019, to Feb. 22, 2021. Then, t X is the number of confirmed cases at time t and 420  N (step 1). In China, t X was suddenly increased at the beginning of the outbreak and then declined after sixty days of the first outbreak. t X s for other countries tended to increase with fluctuations. After vaccination begins in the UK and the USA, t X s for both countries tended to decline. All the datasets considered are checked non-stationary, and the datasets consisting of the first or the second difference between daily confirmed cases turned out to be stationary (step 1). Figure 2 illustrates the first difference t X  (step 1) of the world dataset, while Figure 3 illustrates the ACF and PACF with 60 lags for the set (step 2). Based on Figure 3,

Predictive Models and Prediction Procedures of COVID-19: Stacked LSTM DNN
LSTM is a model that considers the vanishing gradient problem in an RNN, which is dealing with time series. Figure 4 illustrates the structure of an LSTM cell.
158  LSTM cells can be stacked in a network to enhance prediction accuracy. Figure 5 illustrates the structure of a stacked LSTM DNN. All of the LSTM cells in Figure 5 have the same structure as shown in Figure 4. The dataset inputs to LSTM(1) and the output of LSTM(1), h1, is passed to LSTM(2), and the same procedure is performed up to the kth LSTM stack. At the end of the LSTM cell, hk converts to the output via softmax. Algorithm 2 explains the prediction procedure of LSTM DNN.   In step 1, the dataset is pre-processed by min-max transformation (MMT) defined by:  (5) The mean squared error is used as the cost function in step 3. Table 1 summarizes the notation.

Performance Measures
To measure the accuracy of prediction, we consider the mean absolute error (MAE), root mean squared error (RMSE), normalized mean absolute error (NMAE), and the normalized mean squared error (NMSE), which are defined by: 11  , respectively.
We define another measure, percentage improvement (PI), to compare the performance of two models as follows:

Experimental Setting
We conducted the experiments using Python 3.6 and Tensorflow v.1.7.0 on an Intel Core i7, 16 GB RAM. The data used is the daily confirmed cases of COVID-19 for 420 days from Dec. 31, 2019, to Feb. 22, 2021, extracted from the dataset obtained through GitHub [32], provided by the WHO [2]. The datasets of the world and nine countries shown in Figure 1 are selected. The selection intended to include countries from all continents, where culture and policies on the disease are different. The countries are Argentina, Australia, China, Egypt, Germany, India, S. Korea, the UK, and the USA, and the world dataset is used to investigate the global trend of the disease. All three models are applied to each dataset. Besides the ten datasets, we consider a portion of data from each dataset to investigate the effect of data size on performance. The selected portion is for 247 days from Dec. 31, 2019, to Sept. 2, 2020. From now on, we call this dataset a small-size dataset. It is observed that performances depend on training ratio and 0.8~0.9  r turns out to provide better performances than other ratios. Therefore, we mainly use 0.8  r in the experiments and use 0.9 for comparison of ratio effect only. We use the sigmoid for activation function, 100 epochs, and one batch size in the LSTM cells. Since a small-future-step predicts better, we consider a 1-step ahead prediction in the experiment.

Reports of Various Experimental Results
The orders of statistical models are determined by the ACF and the PACF, while the optimal hyperparameters of LSTM DNN, such as the numbers of LSTM cells and blocks of a cell, are obtained through exhaustive search. Let (p,d,q) and (m1,m2) be the obtained orders for statistical models and the hyperparameters for LSTM, which minimize 160 NMAE and NMSE or MAE and RMSE. Here m1 and m2 are the numbers of LSTM cells and block in an LSTM cell, respectively. From now on, we call the (p,d,q) and (m1,m2) optimal parameters. In the experiment, we considered up to five LSTM cells and fifteen blocks in an LSTM cell. We observed that one LSTM cell provides the best NMAE, except for three countries. Figure 6 shows the values of NMAE and NMSE for varying numbers of blocks with one LSTM cell for the two different sizes of datasets. Figures 6(a) and 6(b) are obtained by the small-size datasets and a total of 470 days datasets, respectively. It shows that the fluctuations of NMAE and NMSE are large for small-size datasets, which seems due to more sensitivity of hyperparameters for small-size datasets. On the other hand, the values of NMAE and NMSE with small-size datasets are less than those with large-size datasets. It seems due to the relatively small changes in the predicted values of the small-size datasets since the increments of confirmed cases of the small-size datasets are small compared to the increments of confirmed cases of the total datasets.   Figure 7 shows the daily confirmed cases of four datasets and their predictions of the three models using the optimal parameters. Black and red lines represent raw data and predictions, respectively. In China, many confirmed cases occurred in the early stage of the outbreak, and those seem to affect the model, which appears as the difference between predicted values and actual values. Table 3 will show it explicitly. The increment of confirmed cases in S. Korea was limited in the early stage of the disease. It dues to the policies that identify and disclose the paths of infected people and mandate the mask-wearing. However, it started to increase due to a not predicted public meeting held on August 15, 2020, and the relief of distance policy from the second stage to the first stage from Oct. 12. The distance policy change resulted in group infections through nursing hospitals and church meetings. The number of daily confirmed cases in the USA has soared in the early stage of the disease. It seems due to the culture of wearing masks. The number increased from 646.2 per 1 million people on Dec. 15, 2020, when vaccination began, to 753.3 on Jan. 11, 2021. Since then, it has decreased with slight fluctuation. According to the figure, LSTM DNN is better than the statistic models for the datasets.  Table 2 presents the corresponding NMAE and NMSE of Figure 7 and those of other countries. The value in the 'Optimal' column is an optimal parameter for the dataset. For example, for China corresponding to Figure 7, NMAE and NMSE for LSTM DNN are obtained by 0.2468 and 0.3190, respectively, and (1,1,1), (4,2,1), and (2,4) are the optimal parameters that provide the best NMAE for ARIMA, GARCH, and LSTM DNN, respectively. We notice one LSTM cell, m1=1, is optimal for the datasets, except for three countries. The number of blocks in a cell varies depending on the dataset. Table 3 presents MAE and RMSE. We observed (p,d,q) that minimizes NMAE may not minimize MAE. The optimal hyperparameters for MAE of LSTM DNN are (2, 2), different from (2, 4) for NMAE.   Table 4 presents the performance improvements between models, obtained based on Table 3. It shows that LSTM improves ARIMA and GARCH by 3.36%~88.54% (9.05%~86.63%) and 12.22%~90.15% (14.15%~87.74%), respectively, for MAE (RMSE), while one of ARIMA and GARCH is better depending on data. For example, for the data of Egypt, LSTM improves ARIMA and GARCH by 88.54% (86.63%) and 90.15% (87.74%) for MAE (RMSE), respectively, and GARCH is better than ARIMA for the data of UK and USA.   [21,22] are the number of daily death cases from Jan. 21 to April 11, 2020, and the number of daily confirmed cases from May 10 to June 7, 2020, respectively. Singh et al. (2020) [21] considered ARIMA and wavelet-ARIMA (W-ARIMA), while Singh et al. (2020) [22] considered ARIMA and LS-SVM, and both used 0.8 for the training ratio. Both studies obtained the orders of ARIMA in the same way as ours. The datasets used in Shahid et al. (2020) [23] are the numbers of daily confirmed cases, deaths cases, and recovered cases from Jan. 22 to May 10, 2020. The study considered three models, ARIMA, SVR, and LSTM, with a training ratio of 0.7. In the ARIMA model, (p,d,q)=(1,1,1) was used. The values in the table are results for the daily confirmed cases.
Direct comparisons with these studies seem difficult because the data and training ratios used in the experiments are different. However, we can use the existing measurement ranges as the criterion for our results. Since existing results were obtained with short-term data before vaccination, we also used a similar small-size dataset. Besides, we considered different training ratios for the small-size data to investigate its effect on performance. For the small-size dataset, optimally obtained orders and hyperparameters are used, which are (1,1,1) ((1,2,0)) for ARIMA and GARCH and (1,8) ((1,4)) for LSTM DNN for the data of the UK (USA) for both training ratios. The table shows that MAE and RMSE for the small-size dataset are much less than those for the large dataset. It seems due to the numbers of daily confirmed cases in the small-size dataset, which are less than those in the large dataset. That is, training with a smallsize dataset is suitable to predict a small number of confirmed cases. While, training a training set, in which the small numbers of confirmed cases contained more than the large numbers of confirmed cases, seems not suitable to predict the large numbers of confirmed cases. Based on this observation, we can assume that a large dataset may yield the worse result. However, this cannot be assumed in general, as optimal parameters may vary depending on datasets.
163 Besides, vaccination seems to affect performance. It is noteworthy that the results with the full dataset are better than those of Singh et al. (2020) [22]. Several reasons can be considered, such as vaccination and different orders and training ratios.

Conclusion
The numbers of daily confirmed cases of COVID-19 are analyzed and predicted by three models: ARIMA, GARCH, and stacked LSTM DNN. Datasets of two sizes are used in the experiment to investigate the effects of data size and vaccination on performance. Experimental results show that LSTM DNN predicts best for all datasets in terms of MAE (NMAE) and RMSE (NMSE), while the performances of ARIMA and GARCH depend on datasets. The NN with one LSTM cell outperformed the NN with more LSTM cells in many cases, which should be investigated later. We will expand this study to the NN models, including GAN and meta-learning techniques, and the datasets including more components, such as the number of daily deaths of the disease. The proposed method also can be applied to image data of the disease, such as chest X-rays of patients.

Funding
This work was supported by the Mid-career Research Program through the NRF Grant funded by the Korea government (MEST) (NRF-2019R1A2C1002706).

Ethical Approval
All procedures performed in studies involving human participants were in accordance with the Italian National Research Council (CNR) and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Data Availability Statement
The data presented in this study are available in article.

Conflict of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.