"US could see a very deadly December with tens of thousands of coronavirus death to come, computer model predicts" , "The COVID 19 CRISIS : 300,000 Deaths By December?", "A model predicts COVID 19 death toll will double" … This kind of headline has continued to appear in the international press since the start of the health crisis linked to COVID 19. Predictions of the number of cases or deaths are multiplying and differing between them. How many of us find ourselves skeptical of these many predictions and doubt the numbers.

We propose, throughout this data story, a method that would improve those predictions by including the information present on Google Trends. We are focusing this study around 3 main indicators: Research concerning the symptoms of COVID-19, those concerning social life through outings to restaurants for example, and research related to activities and tutorials on how to make a mask for instance. We propose to compare the contribution of each request in order to lead to an improvement of the prediction. Let’s dive with us in this through our operations during the COVID 19 pandemic and find out what the data shows as changes in our lifestyle.

In order to perform a data analysis of COVID-19 deaths and Google Trends, we need to make several hypotheses. First, an epidemic spreads within a population and therefore epidemic peaks occur at different times around the world. Moreover, we have observed in 2020 the unilateral political measures taken to counter the health crisis. This is why focusing on a single country is important in order to have a homogeneous population subject to the same restrictions at the same time. It is, therefore, more interesting to focus our analysis on a small region: France. Moreover, we focus here only on the mortality of COVID-19 and not on the cases. Indeed, covid positive cases depend on the testing capacity and therefore are less representative than the covid deaths of the different epidemic waves.

Let’s first focus on the COVID-19 deaths data. We downloaded the data from the EU open data portal and then preprocessed them.

Now that we have the COVID-19 deaths, let’s extract the Google Trends data : the keywords or google requests. Which key-words do we choose ? We decided to select different categories of Google requests. The main idea of the category is to have similar keywords within the same group in order to compare them between them. For instance, we want to compare different symptoms and their relevance with the COVID-19 deaths. We present the different categories and associated keywords in the following table.

Symptomes
fievre
toux seche
fatigue
courbatures
maux de gorge
perte du gout
perte odorat
essoufflement
diarrhee
maux de tete
Social
blablacar
taxi
house party
Uber
top cafe
top bar
plato
top restaurant
Comparateur vol
Couvre-feu
How to ?
savoir si on a le coronavirus
mettre un masque
calculer son IMC
faire du pain
calculer distance 100 km
faire un masque sans machine
se transmet le coronavirus
fabriquer un gel hydroalcoolique
se transmet le coronavirus
mettre un masque chirurgical

How did we select the different categories ? Here are some insights :

Symptoms : Google keywords related to covid symptoms as cough, fever, loss of taste, etc. Those keywords give an insight of the current concerns of google users about the COVID 19 symptoms and can explain and can give a more precise idea of ​​the evolution of the virus.

Social : Google keywords related to social activities such as top restaurant, top bar, etc. Those keywords give some details on the various social "movements". The idea would be to explain a link between these indicators and the evolution of COVID-19. This would be particularly interesting given the many restrictions around social life that are implemented by governments.

How to : Google keywords of the top 10 how to questions in 2020 related to different fields. We are looking, through these keywords, to show if there is a link between the increase in questions about how to make a mask and the evolution of the virus.

Now that we have our Google requests dataset, let’s plot them. You can display or not each key-word by clicking on its label in the legend.

A strong first wave

The symptom key words already reveal some insights. We see clearly many requests for all the keywords in March/April 2020 (the first epidemic wave in France). What else do you see ?

Cross-correlation Analysis

Are COVID deaths and Google Trends related ?


Now that we have the COVID-19 deaths data and the Google trends for our different categories, we can compare them. In this section, we will perform a spearman cross-correlation analysis between the COVID-19 deaths and the Google Trends key-words for each category. Let’s see how it looks like.

What can we see?

The most specific symptoms of COVID 19 are the most correlated to the death data. This is an expected result since the more the virus evolves, the more people tend to learn about the symptoms.

What can we see?

In general, social keywords tend to have a high correlation compared to the 2 other categories. This is particularly interesting in the sense that there is a strong relationship between the evolution of the virus and the evolution of the social way of life.

What can we see?

The same conclusions can be drawn for this category. It follows the evolution of the spread of the virus.

Statistical Significance


Now that the different features have been selected and the first step of the study has been carried out, we propose a second approach to see the relationship differently. The idea is to fit a linear regressor ( AR(1) ) on the data. The model takes as external variables the index corresponding to each keyword.

Let’s fit our model on the data and extract the relevant features. The idea is to have an insight into the significant features of the prediction model. The metric used is a statistical test on the p-value, where a p-value less than the significant value of 0.05 means that there's no significant relationship between the feature and the death data. This procedure is applied to the 3 different categories and the results for the most interesting requests per category in the following visualizations.

What can we see ?

We notice that 3 requests are significantly linked to the COVID deaths: shortness of breath, loss of smell, and loss of taste. This is interesting since the queries related to the most recognizable COVID symptoms tend to have smaller p-values than the other. This allows confirming the conclusions of the correlation chapter. These keywords are completely dependant on the pandemic situation.

What can we see ?

This time, the results are unexpected. The p-value tends to be smaller for the requests related to blablacar, uber, and flights comparator. These requests are negatively correlated with the evolution of death. But none of the requests have a significant effect on the death data, which is quite surprising. We would expect a stronger statistical relation.

What can we see ?

Finally, the how to indicators are quite interesting. We observe that the most significant requests are the ones related to how to design a home mask for example, or how to calculate the IMC. Those requests are totally related to the COVID situation and especially to the lockdown situation.

Forecasting

Can we forecast the COVID-19 deaths using Google Trends ?


Let's explore the one step ahead forecast using an AR-1 model. The predictions are made on the second wave with the data that happen before a specific date we are predicting. The goal of this part is to explore the performance of Google Trends data in improving the prediction of the second wave of COVID-19. The MAE is computed between the base model and the trend model. The idea here is to observe the effect of each request on the MAE and look for improvements. Thus, a trend model is computed for each request. The resulting improvement (or deterioration) of the MAE is then plotted and some conclusions are made.

What can we see ?

The highest MAE improvement appears when we include the loss of smell in the model, and improvement is still visible for all the COVID-19's specific symptoms. This result is interesting : By including the requests related to the most particular COVID symptoms, we improve the forecasting model.

What can we see ?

For this category, we expected that kind of result: The requests that help to improve the model are the queries that are common to the two waves. For example, there's a decrease of the queries related to the flight comparator because of lockdown, and this effect is present in the 2 waves. But, if we take the plato query, which was a game that got really popular in the first wave of COVID 19 and totally disappeared in the second one, we get an important deterioration of the MAE. This could mean that the social phenomena in the two waves are different. However, we would have expected a larger effect for the "The Top cafe" and "Top restaurant" keywords. The smaller observed effect could be explained by the fact that the model is still a simple model and doesn't learn to "anticipate", since these two keywords should have an indicator role before the wave arrives.

What can we see ?

Here we observe that most of the features have a negative improvement. That can be interpreted by the fact that these features could help in fitting the first wave, but are not features that can generalize this phenomenon, therefore, when we try to predict the second wave we get a worse result than with the base model. That is called overfitting.

Discussion

Through this analysis, we figured out that the topic the most related to the evolution of the number of deaths in France was the symptoms and social-related activities. We then have seen that some of the symptoms like the loss of smell or taste are intrinsic features to predict the number of deaths from the COVID-19. Theses features helped to reach an improvement of 15% from the base model.

Moreover, some of the social-related activities can help better predict the number of the covid cases, nevertheless, the others were really helpful to predict the first wave, but less with the second wave. Therefore, we cannot rely on all of the social-related features to predict the number of deaths from COVID-19.

Finally, we have seen that including some features from Google Trends in the base model, could help better predict noisy data such as COVID-19 deaths.

To go further, it would be interesting to combine the best features together to get an even higher improvement. After trying on the simple model that we had, not such good results have been found. It would be interesting to try with other models.

Authors

Made by Baudoin de Sury, Mehdi Akeddar and Jérémy Plassmann.