"US could see a very deadly December with tens of thousands of coronavirus death to come, computer model predicts" , "The COVID 19 CRISIS : 300,000 Deaths By December?", "A model predicts COVID 19 death toll will double" … This kind of headline has continued to appear in the international press since the start of the health crisis linked to COVID 19. Predictions of the number of cases or deaths are multiplying and differing between them. How many of us find ourselves skeptical of these many predictions and doubt the numbers.
We propose, throughout this data story, a method that would improve those predictions by including the information present on Google Trends. We are focusing this study around 3 main indicators: Research concerning the symptoms of COVID-19, those concerning social life through outings to restaurants for example, and research related to activities and tutorials on how to make a mask for instance. We propose to compare the contribution of each request in order to lead to an improvement of the prediction. Let’s dive with us in this through our operations during the COVID 19 pandemic and find out what the data shows as changes in our lifestyle.
In order to perform a data analysis of COVID-19 deaths and Google Trends, we need to make several hypotheses. First, an epidemic spreads within a population and therefore epidemic peaks occur at different times around the world. Moreover, we have observed in 2020 the unilateral political measures taken to counter the health crisis. This is why focusing on a single country is important in order to have a homogeneous population subject to the same restrictions at the same time. It is, therefore, more interesting to focus our analysis on a small region: France. Moreover, we focus here only on the mortality of COVID-19 and not on the cases. Indeed, covid positive cases depend on the testing capacity and therefore are less representative than the covid deaths of the different epidemic waves.
Let’s first focus on the COVID-19 deaths data. We downloaded the data from the EU open data portal and then preprocessed them.
Now that we have the COVID-19 deaths, let’s extract the Google Trends data : the keywords or google requests. Which key-words do we choose ? We decided to select different categories of Google requests. The main idea of the category is to have similar keywords within the same group in order to compare them between them. For instance, we want to compare different symptoms and their relevance with the COVID-19 deaths. We present the different categories and associated keywords in the following table.
Symptomes |
---|
fievre |
toux seche |
fatigue |
courbatures |
maux de gorge |
perte du gout |
perte odorat |
essoufflement |
diarrhee |
maux de tete |
Social |
---|
blablacar |
taxi |
house party |
Uber |
top cafe |
top bar |
plato |
top restaurant |
Comparateur vol |
Couvre-feu |
How to ? |
---|
savoir si on a le coronavirus |
mettre un masque |
calculer son IMC |
faire du pain |
calculer distance 100 km |
faire un masque sans machine |
se transmet le coronavirus |
fabriquer un gel hydroalcoolique |
se transmet le coronavirus |
mettre un masque chirurgical |
How did we select the different categories ? Here are some insights :
Symptoms : Google keywords related to covid symptoms as cough, fever, loss of taste, etc. Those keywords give an insight of the current concerns of google users about the COVID 19 symptoms and can explain and can give a more precise idea of the evolution of the virus.
Social : Google keywords related to social activities such as top restaurant, top bar, etc. Those keywords give some details on the various social "movements". The idea would be to explain a link between these indicators and the evolution of COVID-19. This would be particularly interesting given the many restrictions around social life that are implemented by governments.
How to : Google keywords of the top 10 how to questions in 2020 related to different fields. We are looking, through these keywords, to show if there is a link between the increase in questions about how to make a mask and the evolution of the virus.
Now that we have our Google requests dataset, let’s plot them. You can display or not each key-word by clicking on its label in the legend.
Cross-correlation Analysis
Are COVID deaths and Google Trends related ?
Now that we have the COVID-19 deaths data and the Google trends for our different categories, we can compare them. In this section, we will perform a spearman cross-correlation analysis between the COVID-19 deaths and the Google Trends key-words for each category. Let’s see how it looks like.
The most specific symptoms of COVID 19 are the most correlated to the death data. This is an expected result since the more the virus evolves, the more people tend to learn about the symptoms.
In general, social keywords tend to have a high correlation compared to the 2 other categories. This is particularly interesting in the sense that there is a strong relationship between the evolution of the virus and the evolution of the social way of life.
The same conclusions can be drawn for this category. It follows the evolution of the spread of the virus.
Statistical Significance
Now that the different features have been selected and the first step of the study has been carried out, we propose a second approach to see the relationship differently. The idea is to fit a linear regressor ( AR(1) ) on the data. The model takes as external variables the index corresponding to each keyword.
Let’s fit our model on the data and extract the relevant features. The idea is to have an insight into the significant features of the prediction model. The metric used is a statistical test on the p-value, where a p-value less than the significant value of 0.05 means that there's no significant relationship between the feature and the death data. This procedure is applied to the 3 different categories and the results for the most interesting requests per category in the following visualizations.
We notice that 3 requests are significantly linked to the COVID deaths: shortness of breath, loss of smell, and loss of taste. This is interesting since the queries related to the most recognizable COVID symptoms tend to have smaller p-values than the other. This allows confirming the conclusions of the correlation chapter. These keywords are completely dependant on the pandemic situation.
This time, the results are unexpected. The p-value tends to be smaller for the requests related to blablacar, uber, and flights comparator. These requests are negatively correlated with the evolution of death. But none of the requests have a significant effect on the death data, which is quite surprising. We would expect a stronger statistical relation.
Finally, the how to indicators are quite interesting. We observe that the most significant requests are the ones related to how to design a home mask for example, or how to calculate the IMC. Those requests are totally related to the COVID situation and especially to the lockdown situation.
Forecasting
Can we forecast the COVID-19 deaths using Google Trends ?
Let's explore the one step ahead forecast using an AR-1 model. The predictions are made on the second wave with the data that happen before a specific date we are predicting. The goal of this part is to explore the performance of Google Trends data in improving the prediction of the second wave of COVID-19. The MAE is computed between the base model and the trend model. The idea here is to observe the effect of each request on the MAE and look for improvements. Thus, a trend model is computed for each request. The resulting improvement (or deterioration) of the MAE is then plotted and some conclusions are made.
The highest MAE improvement appears when we include the loss of smell in the model, and improvement is still visible for all the COVID-19's specific symptoms. This result is interesting : By including the requests related to the most particular COVID symptoms, we improve the forecasting model.
For this category, we expected that kind of result: The requests that help to improve the model are the queries that are common to the two waves. For example, there's a decrease of the queries related to the flight comparator because of lockdown, and this effect is present in the 2 waves. But, if we take the plato query, which was a game that got really popular in the first wave of COVID 19 and totally disappeared in the second one, we get an important deterioration of the MAE. This could mean that the social phenomena in the two waves are different. However, we would have expected a larger effect for the "The Top cafe" and "Top restaurant" keywords. The smaller observed effect could be explained by the fact that the model is still a simple model and doesn't learn to "anticipate", since these two keywords should have an indicator role before the wave arrives.
Here we observe that most of the features have a negative improvement. That can be interpreted by the fact that these features could help in fitting the first wave, but are not features that can generalize this phenomenon, therefore, when we try to predict the second wave we get a worse result than with the base model. That is called overfitting.
The symptom key words already reveal some insights. We see clearly many requests for all the keywords in March/April 2020 (the first epidemic wave in France). What else do you see ?