Skip to content

Can we predict cases of dengue with climate variables?

Recently, I could discover a new website about competitions that it is not called Kaggle! Its name is Drivendata.

DrivenData offers different competitions related with multiple types of field, such as health (oh yes!), ecology, society… with a common element: to face the world’s biggest social challenges.

I decided to join my first competition called ‘DengAI: Predicting Disease Spread‘. In this case, the user receives a set of weather information (temperatures, precipitations, vegetations) from two cities: San Juan (Puerto Rico) and Iquitos (Peru) with total cases of dengue by year and week of year.

The goal of the competition is to develop a prediction model that would be able to anticipate the cases of dengue in every country depending on a set of climate variables mentioned above.

The DrivenData’s blog wrote some days ago, a post about a fast approach with this dataset. It was written in Python (Arghhh….). So, I decided to “translate” to R language.

The next code is divided into three main points:

1. Code with clean tasks (transform NA values, remove of columns…) and exploratory analyses.

2. Written function with every step during cleaning of data

3. Development of model, prediction and comparison of predicted vs real total cases detected.

—–o—–

1. Code with clean tasks (transform NA values, remove of columns…) and exploratory analyses.

Distribution of total_cases in Iquitos
Distribution of total_cases in San Juan

 

Lineal plot of ndvi_ne variable including NA values
Correlation plot of variables – San Juan

 

Correlation plot of variables – Iquitos

 

Correlation plot (Bar plot) with variables – Iquitos

 

Correlation plot (Bar plot) with variables – San Juan

 

2. Cleaning function step-by-step. 


3. Development of model, prediction and comparison of predicted vs real total cases detected.

 

Lineal plot with predicted and actual total_cases – Iquitos

 

Lineal plot with predicted and actual total_cases – San Juan

As you can see the total cases predicted by the model differs from real cases through interesting element such as lack of synchronicity or the undetected peaks of cases in concrete intervals. During the next weeks I will continue to learn more about this interesting disease to fit better my model competition and I will share here! Stay tuned!

Published inSin categoría

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *