Can we predict cases of dengue with climate variables?

Recently, I could discover a new website about competitions that it is not called Kaggle! Its name is Drivendata.

DrivenData offers different competitions related with multiple types of field, such as health (oh yes!), ecology, society… with a common element: to face the world’s biggest social challenges.

I decided to join my first competition called ‘DengAI: Predicting Disease Spread‘. In this case, the user receives a set of weather information (temperatures, precipitations, vegetations) from two cities: San Juan (Puerto Rico) and Iquitos (Peru) with total cases of dengue by year and week of year.

The goal of the competition is to develop a prediction model that would be able to anticipate the cases of dengue in every country depending on a set of climate variables mentioned above.

The DrivenData’s blog wrote some days ago, a post about a fast approach with this dataset. It was written in Python (Arghhh….). So, I decided to “translate” to R language.

The next code is divided into three main points:

1. Code with clean tasks (transform NA values, remove of columns…) and exploratory analyses.

2. Written function with every step during cleaning of data

3. Development of model, prediction and comparison of predicted vs real total cases detected.

—–o—–

1. Code with clean tasks (transform NA values, remove of columns…) and exploratory analyses.

Distribution of total_cases in Iquitos
Distribution of total_cases in San Juan

 

Lineal plot of ndvi_ne variable including NA values
Correlation plot of variables – San Juan

 

Correlation plot of variables – Iquitos

 

Correlation plot (Bar plot) with variables – Iquitos

 

Correlation plot (Bar plot) with variables – San Juan

 

2. Cleaning function step-by-step. 


3. Development of model, prediction and comparison of predicted vs real total cases detected.

 

Lineal plot with predicted and actual total_cases – Iquitos

 

Lineal plot with predicted and actual total_cases – San Juan

As you can see the total cases predicted by the model differs from real cases through interesting element such as lack of synchronicity or the undetected peaks of cases in concrete intervals. During the next weeks I will continue to learn more about this interesting disease to fit better my model competition and I will share here! Stay tuned!

Analyzing the consume of drugs in Europe

The European Monitoring Center for Drugs and Drug Addiction (EMCDDA) is a public organization which provides an overview of the european drugs problems with relevant data, such as the consume of a determined type of drugs (cannabis, ecstasy, cocaine…), price, purity…In addition, this information is sorted by relevant variables such as genre, age or country. Through the website,we can access to that knowledge, but there is an inconvenient, the whole information is provided by numerous tables that can be downloaded in .xlsx format, so the process of merging multiples tables with data of interest is slow and tedious.  Thereby, there is not any exploratory or visualization tool which helps the scientific community to understand this valuable information, but hidden in hundreds of separate tables.

So, I decided to develop an application which solves these problems that makes the information more accessible to the scientific community, combining two tools: R and Shiny.

The first step is to gather all the information provided in .xslx format. I downloaded every table from the section “Prevalence of drug use” and I put them together in an unique dataframe. In total we have 504 columns (yes, a lot) where every column represents the consume of a certain type of drug under a concrete condition. Every column had to be identified by a descriptive name, thereby I elaborated  a code of strings with legends that you can see in the next image:

For example, the column with the following information:


(1).Consume of cocaine + (2).Last year + (3).35-54 yr. + (4).Male

(Column name) “coc_year_35_ma”

Below, I show you how the application treat that information:

Once created the dataframe with the total columns with their respective code of names, it is time to play with data. I decided to use different visualization tools which reflect the same type of data, so the user could tackle same information with different perspectives:


In my opinion, the graphics should be as simple as possible, because it helps the user to understand easier the information displayed. Due to this, I would like to emphasize two details:

1. As you can see, the scale color used in the map and the bar plot of consume is exactly the same. Keeping the same scale (1) between different graphics help the user to think less about what the scale represents.

2. At the first moment, I did not want to create an interactive graphic because many times add an extra of complexity unneeded, but in this case, it was needed.The function of an interactive box plot was double: first, the user can see a legend along every point with the country and the percentage of consume label. Second, the majority of points are located in a reduced interval of values, then, the user can zoom in every interval of interest and appreciates the differences.

Finally,  two features were added that helps the user saves the data produced by the application:

1. The user can download the data chosen in format .csv. Thereby, it can be imported in Excel and create custom graphics.

2. A Rmarkdown report in format .html with all the graphics generated by the app.

We can get some interpretations through the visualization of data:

– Consume of cocaine between countries. If we compare the routes where the drug traffickers introduce the narcotics in the european territory (as you can see in the picture below)(2) and the consumption by country; we can see a  negative linear relation between the consume and the distance that the drug has to travel from the  countries of arrival (Spain, Nertherlands, Belgium, Italy and France) (2).

In addition, we could analyze quantitatively the values of consume of cocaine respect to the distance of travel needed. We can calculate the sum of distances between the three largest cities of every country and the five main countries of arrival, represented by their capitals.

Yes, but how could we do this? Of course, with R:

Residuals:
Min 1Q Median 3Q Max
-0.77725 -0.36353 -0.09222 0.25582 1.42392

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.625e+00 1.787e-01 9.091 1.8e-13 ***
distance -1.054e-07 1.966e-08 -5.363 1.0e-06 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5408 on 70 degrees of freedom
Multiple R-squared: 0.2912, Adjusted R-squared: 0.2811
F-statistic: 28.76 on 1 and 70 DF, p-value: 9.997e-07

We can create a map of connections which represent the three largest cities of every country with the 5 arrival countries (mentioned above) represented by their capitals:

 

Consume of drugs in female population. It is always lower than in male population. Looking for information about this fact, I founded an interesting article in the British newspaper “Telegraph” where talked about the culture factor:


“Historically, women have had less freedom to use drugs,” she said. “There are different expectations of them, and they probably have different expectations of themselves as well. You can see that it’s a cultural thing from the breakdown by ethnicity in the government report. With the white British demographic, twice as many men have taken drugs as women, whereas in some Asian communities it’s about four times as many” – Dr Jane Marshall, a consultant psychiatrist based at the NHS South London.

Low values of Turkey. Every condition analyzed represents that Turkey has the lowest value of consume. Clearly, the cultural differences are a crucial element in the consume of drugs:

Furthermore, the culture differences affects also to the differences between genre:

Country Female population (15-64 yr.) (%) Male population (15-64 yr.) (%) Difference (%)
Hungary 72 86.7 +120.41
Ireland 83.2 87.5 +105.16
Italy 74.4 90.2 +121.23
Latvia 83.1 87.5 +105.29
Lithuania 78.2 85.9 +109.84
Austria 71 78.4 +110.42
Norway 85.6 86.3 +100.81
Portugal 49.3 73.6 +149.29
Romania 54.2 76.1 +140.40
Slovakia 72.69 82.29 +113.20
Spain 73.4 83.2 +113.35
Turkey 6.8 21.5 +316.17
Bulgaria 61.8 86.5 +139.96
Croatia 62.5 81.1 +129.76
Estonia 81 89.5 +110.49

These analyzes are just a small amount of examples that we can develop with this data. I invite you to use the app and share your ideas in the comments section.

 

ACCESS DRUGSPLOT

 

 

 

(1). Maybe, you surprise why I did not use sequential color. In fact, my first option was sequential color but the problem is “Turkey”. I mean, every country have a similar value (more or less) but Turkey in every drug, age, genre…it doesn’t matter…Turkey has always an amazing low value (very healthy people, i guess…). So, when I decided to use sequential color…the contrast between Turkey and the rest of countries was good….instead the contrast between the rest of countries was very low…so I decided to use a scale which has more than a color, such as “Spectral” a diverging scale…where the contrast, with this data, is higher than using sequential.

(2). Source: http://www.emcdda.europa.eu/topics/pods/cocaine-trafficking-to-europe.

Difference between unique() and duplicated()

When we work with data, we usually find with an obstacle: repeated values. This type of values don’t represent a critical problem if we have the ability to identify. Once we have that list of repeated values, it is very easy to discard, eliminate or simply extract.

We are going to see two type of functions in R which allow to identify repeated values: unique() and duplicated() function. Besides, as we will see below, we can use these functions with different types of data, such as vectors, matrix or dataframes.

  • As we can see, unique() function uses numeric indicators to determine unique values.
  • Instead, duplicated() function uses logical values to determine duplicated values.

Besides, we can use these functions in matrix:

Now, we will identify unique and duplicated rows, using very common dataframe called iris. Besides, we will also select not repeated rows:

Finally, we can see that we can obtain the same result with iris[unique(iris),] and iris[!duplicated(iris),]

How to create a fast and easy heatmap with ggplot2

The heatmaps are a tool of data visualization broadly widely used with biological data. The concept is to represent a matrix of values as colors where usually is organized by a gradient. We can find a large number of these graphics in scientific articles related with gene expressions, such as microarray or RNA-seq.

In the next example, we are going to represent a dataframe of gene expression values of 20 genes and 20 patients.

Once we have our dataframe (df_heatmap), we can visualize the values with the package ggplot2.

rplot

rplot01

rplot09

 

Difference between paste() and paste0()

Probably, function paste is one of the most used function in R. The objective of this function is concatenate a series of strings.

The arguments of the function are:

= The space to write the series of strings.

sep = The element which separates every term. It should be specified with character string format.

collapse = The element which separates every result. It should be specified with character string format and it is optional. i

We can see an example where both arguments works together:

As we can see in fourth example, if we specify a value in argument collapse, we obtain an unique string instead of five as in the previous example

The difference between paste() and paste0() is that the argument sep by default is ” ” (paste) and “” (paste0).

In conclusion, paste0() is faster than paste() if our objective is concatenate strings without spaces because we don’t have to  specify the argument sep.