Credit Analysis using Machine Learning

Updated: Jul 29, 2020





Introduction

In the business world, the credit policy is one of the main strategic sales policies. Knowing how to analyze the risks of the approved credit and credit not granted based on statistical science is fundamental for the balance of the business.


The non-use of data technology to support the credit policy plan generates a loss of customers in cases of credits not granted to good customers and, on the other hand, generates non-payment and delays in receivables through credit approval for defaulting customers, generating cash flow problems.


The poorly calibrated credit analysis makes the company need more working capital to cover the cash flow and also raises the level of financial expenses with prepayments and replacement of receivables.


Therefore, it is essential to apply data science technology for credit analysis to generate results of increasing market share, sales, reducing financial expenses, and reducing the amount of working capital necessary to balance cash flow.

Summary

This experiment aims to demonstrate the process of building a model

classification system to predict the risk of granting credit to customers of a

Bank. We will use a dataset to build and train our model.


Dataset Information

The “German Credit Data” dataset will be used to build and

train the model in this experiment. This dataset is based on real data

generated by a researcher at the University of Hamburg, Germany.

The dataset contains 1000 observations and 20 variables, representing the

customer data, such as: current account status, credit history,

current credit amount, employability, residence, age, etc.

In our repository bellow, you can find all details and scripts for all projects.

Dataset: https://github.com/lexxconsulting


Objective

The objective will be to predict the risk that each customer offers to the bank,

during the time to grant a credit line. The predictive model must be quite

necessary, as granting credit to a customer with a bad payment potential,

can bring a huge loss to the bank.


Lyfecycle


1. Data Transformation

2. Exploratory Analysis

3. Feature Selections

4. Generating a Predictive Model

5. Optimizing the Predictive Model

6. Analyzing the Predictive Models

The figure bellow show us the whole process using the Microsoft toll Azure Machine Learning Studio.




1. Data Transformation

In this step we will perform the following routines, considering the original repository dictionary from the original dataset:


• Label the columns

• Transform numerical variables into factors

• Transform numeric variables with a considerable number of unique variables into factors

• Quantize the target variable


The figure bellow show us all the named columns and the data type transformed.


Balancing the data target

This process calls quantizing, when we use a statistical algorithm responsible for balancing the target variable, making the classes have the same sample.

See in figure1 bellow, before the balancing, the graph target samples had very different sizes and after balancing, we minimize this difference in the figure2.

The lack of balance makes the forecasting algorithm learn more about the target variable that has more samples than the other and this generates predictive levels and damages the model.


Figure1.


Figure2.


2. Descriptive Analysis

As the intuition is just a general approach to the process, we will not approach all descriptive graphics, just for example.

It is worth remembering that during this process, we can evaluate the variables and even exclude some by simple visual conclusion of correlation, that is, it is possible in some cases to also identify correlations between variables through descriptive analysis.



3. Feature Selections

Although there are models in Azure for this purpose called “Group Data into Bins”, we will use here the Random Forest algorithm with the R language, in order to return the most important variables.

A we can see bellow, the Random Forest algorithm selected the most important variables to be used in the predict model.


4. Generating a Predictive Model

For this process we use three predictive algorithms to train the model using Azure, they are:


• Class Bayes Point Mach

• Class Neural Network

• SVM


With the application of these three models, we can have more options to choose from and thus verify which model had better predictive performance.



5. Optimizing the Predictive Model

In this step we must consider all the possibilities that allow to optimize the models and that we use in our development process, the steps that we can consider are:

  • The business team can determine which variables are important

  • Select different variables

  • Insert external data for better standardization of historical internal data

  • Include new variables

  • Use different algorithms

  • Quantization

  • Optimize the parameters of the algorithms

  • Trade-off between false positives and false negatives


6. Analyzing the Predictive Models

In this step we will evaluate all the results of precision, comparing the results generated by the three algorithms applied in Azure and by the algorithm using the language R, in order to identify which one will be chosen to be published.


The figure bellow shows the result between two models, where the "Score dataset" legend (blue color) represent the "Class Bayes Point Mach" algorithm and the (red color) represent the Class Neural Networking.



Bellow we can see the Class Neural Networking Model Final Result (Red color Graph)


Bellow we can see the Class Bayes Point Mach Model Final Result (Blue color Graph)




As we can see, the best model is Class Bayes Point Mach (Blue color Graph)

Now we will compare the model Class Bayes Point Mach (Blue color Graph) with the another model SVM, to define which on is better.

Bellow we can see the SVM results and we can see that the model Class Bayes Point Mach still having the best accuracy = 68,3%.


Now we will compare the Azure results with the Random Forest algorithm develop in R language program and as we can see, the Random Forest accurate = 68,62%



Final Conclusion

So, when we compare the accuracy results between Random Forest using R and the Class Bayes Point Mach using Azure, we conclude that we had a technical tie, both with 68%.

How we discussed in the topic 5, we can improve the accuracy, creating different scenarios.

After all these steps above, with the model selected, we only import the dataset with the same format into R or Azure and them we just need to publish and the model will return the prediction to make decision if the credit will be approved or not.

Although, we can display any credit analyze information in real time for the credit sector to drive decision, using Power Bi visualization, as we have demonstrated in other articles.

These innovations technologies tolls together can support your business to drive decision for the next level.


Benefits of Credit Analysis using Machine Learning

  • Minimize the Credit Risk

  • Evaluate the cost of loss of sale versus cost of unpaid credit

  • Cost Reductions with Human Resource

  • Increased sales through credit approval from customers previously denied

  • Increased sales and by adjusting the ideal credit limit

  • Improved cash flow

  • Opportunity cost reduction,etc.




40 views