Prediction of Diabetes Diagnosis via Machine Learning

Updated: Nov 6, 2020



Introduction

Data science has been constantly used in diagnostics in the health area. This article presents some machine learning techniques used to make predictions of diagnoses of diabetic patients using the Python language.


Objective

Create a predictive model that is able to predict whether or not a person can develop diabetes. For this, we will use historical patient data, available in the dataset below.

Dataset: Pima Indians Diabetes Dataset http://archive.ics.uci.edu/ml/datasets/diabetes

This dataset describes the medical records among Pima Indians patients and each record is marked whether or not the patient developed diabetes.


Initially, it is important to present a dataset approach below.


Attribute information:

  1. Number of times pregnant

  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test

  3. Diastolic blood pressure (mm Hg)

  4. Triceps skin fold thickness (mm)

  5. 2-Hour serum insulin (mu U/ml)

  6. Body mass index (weight in kg/(height in m)^2)

  7. Diabetes pedigree function

  8. Age (years)

  9. Class variable (0 or 1)

Python Libraries

Firstly, we have to import some libraries to work with data manipulation, data visualization and machine learning.

As we develop our model, we will present more details on the features and applicability of the different algorithms that we will be using.

Although there are popular algorithms that generally have better adherence to some cases, good practice recommends performing tests with different algorithms and choosing the ones that in fact have the best accuracy.

Another important factor to find the best accuracy, will be to find the best hyperparameters of the algorithms and for that we will present some examples of how to do this using Python, because when we use the cloud environment this can be done automatically.

Especially in the health area, where we deal with people's lives and we need to obtain the best possible accuracy and avoid Error I and Error II.

We will present each algorithm within a template, thus allowing the use of it outside that project.

The process becomes more and more efficient when we work with templates that require the lowest possible level of customization, as this facilitates the development process.

Many cloud providers have been applying this technique in their development environments. Microsoft Azure for example acts strongly in the process of automating and standardizing templates that require little customization.

The cloud environment makes it easy to carry out the testing of various algorithms and hyperparameters in a practical and efficient way, thus reducing production time and minimizing errors.

As in this project we will not use the cloud environment, we will demonstrate the main steps of the machine learning technique that can be applied to the classification method via Python.

We will also present the versions of the libraries and the Python version to prevent errors for configuration of the environment.

Fig. 1. Libraries Versions and Python Version


Extracting and Loading Data

There are several considerations when uploading data to the Machine Learning process. For example: does your data have a header? If not, you will need to define the title for each column. Do your files have comments? What is the column delimiter? Are some data in quotes, single or double?


Fig.2 Loading the data


Exploratory Data Analysis


Descriptive statistics

After importing the data, we need to apply descriptive statistics techniques to present the data. This process introduces a better understanding of the data and in many cases we were able to identify patterns and outliers in the distribution of the data.


Fig.3 Top 20 lines


If the number of lines in your file is very large, the algorithm can take a long time to be trained. If the number of records is too small, you may not have enough records to train your model.

If you have many columns in your file, the algorithm may have performance problems due to the high dimensionality.

The best solution will depend on each case. But remember: train your model in a subset of your larger data set and then apply the model to new data.

The type of data is very important. It may be necessary to convert strings, or columns with integers can represent categorical variables or ordinary values.

Below we will present a general approach on the data

Fig. 4 Statistics summary and Data type.


Balancing the classes

In classification problems it may be necessary to balance the classes. Unbalanced classes (ie, greater volume of one of the class types) are common and need to be addressed during the pre-processing phase. We can see below that there is a clear disproportion between classes 0 (non-occurrence of diabetes) and 1 (occurrence of diabetes).