Introduction
In the business world, the credit policy is one of the main strategic sales policies. Knowing how to analyze the risks of the approved credit and credit not granted based on statistical science is fundamental for the balance of the business.
The non-use of data technology to support the credit policy plan generates a loss of customers in cases of credits not granted to good customers and, on the other hand, generates non-payment and delays in receivables through credit approval for defaulting customers, generating cash flow problems.
The poorly calibrated credit analysis makes the company need more working capital to cover the cash flow and also raises the level of financial expenses with prepayments and replacement of receivables.
Therefore, it is essential to apply data science technology for credit analysis to generate results of increasing market share, sales, reducing financial expenses, and reducing the amount of working capital necessary to balance cash flow.
Summary
This experiment aims to demonstrate the process of building a model
classification system to predict the risk of granting credit to customers of a
Bank. We will use a dataset to build and train our model.
Dataset Information
The “German Credit Data” dataset will be used to build and
train the model in this experiment. This dataset is based on real data
generated by a researcher at the University of Hamburg, Germany.
The dataset contains 1000 observations and 20 variables, representing the
customer data, such as: current account status, credit history,
current credit amount, employability, residence, age, etc.
In our repository bellow, you can find all details and scripts for all projects.
Dataset: https://github.com/lexxconsulting
Objective
The objective will be to predict the risk that each customer offers to the bank,
during the time to grant a credit line. The predictive model must be quite
necessary, as granting credit to a customer with a bad payment potential,
can bring a huge loss to the bank.
Lyfecycle
1. Data Transformation
2. Exploratory Analysis
3. Feature Selections
4. Generating a Predictive Model
5. Optimizing the Predictive Model
6. Analyzing the Predictive Models
The figure bellow show us the whole process using the Microsoft toll Azure Machine Learning Studio.
1. Data Transformation
In this step we will perform the following routines, considering the original repository dictionary from the original dataset:
• Label the columns
• Transform numerical variables into factors
• Transform numeric variables with a considerable number of unique variables into factors
• Quantize the target variable
The figure bellow show us all the named columns and the data type transformed.
Balancing the data target
This process calls quantizing, when we use a statistical algorithm responsible for balancing the target variable, making the classes have the same sample.
See in figure1 bellow, before the balancing, the graph target samples had very different sizes and after balancing, we minimize this difference in the figure2.
The lack of balance makes the forecasting algorithm learn more about the target variable that has more samples than the other and this generates predictive levels and damages the model.
Figure1.
Figure2.
2. Descriptive Analysis
As the intuition is just a general approach to the process, we will not approach all descriptive graphics, just for example.
It is worth remembering that during this process, we can evaluate the variables and even exclude some by simple visual conclusion of correlation, that is, it is possible in some cases to also identify correlations between variables through descriptive analysis.
3. Feature Selections
Although there are models in Azure for this purpose called “Group Data into Bins”, we will use here the Random Forest algorithm with the R language, in order to return the most important variables.
A we can see bellow, the Random Forest algorithm selected the most important variables to be used in the predict model.
4. Generating a Predictive Model
For this process we use three predictive algorithms to train the model using Azure, they are:
• Class Bayes Point Mach
• Class Neural Network
• SVM
With the application of these three models, we can have more options to choose from and thus verify which model had better predictive performance.
5. Optimizing the Predictive Model
In this step we must consider all the possibilities that allow to optimize the models and that we use in our development process, the steps that we can consider are:
The business team can determine which variables are important
Select different variables
Insert external data for better standardization of historical internal data
Include new variables
Use different algorithms
Quantization
Optimize the parameters of the algorithms
Trade-off between false positives and false negatives
6. Analyzing the Predictive Models
In this step we will evaluate all the results of precision, comparing the results generated by the three algorithms applied in Azure and by the algorithm using the language R, in order to identify which one will be chosen to be published.
The figure bellow shows the result between two models, where the "Score dataset" legend (blue color) represent the "Class Bayes Point Mach" algorithm and the (red color) represent the Class Neural Networking.
Bellow we can see the Class Neural Networking Model Final Result (Red color Graph)
Bellow we can see the Class Bayes Point Mach Model Final Result (Blue color Graph)
As we can see, the best model is Class Bayes Point Mach (Blue color Graph)
Now we will compare the model Class Bayes Point Mach (Blue color Graph) with the another model SVM, to define which on is better.
Bellow we can see the SVM results and we can see that the model Class Bayes Point Mach still having the best accuracy = 68,3%.
Now we will compare the Azure results with the Random Forest algorithm develop in R language program and as we can see, the Random Forest accurate = 68,62%
Final Conclusion
So, when we compare the accuracy results between Random Forest using R and the Class Bayes Point Mach using Azure, we conclude that we had a technical tie, both with 68%.
How we discussed in the topic 5, we can improve the accuracy, creating different scenarios.
After all these steps above, with the model selected, we only import the dataset with the same format into R or Azure and them we just need to publish and the model will return the prediction to make decision if the credit will be approved or not.
Although, we can display any credit analyze information in real time for the credit sector to drive decision, using Power Bi visualization, as we have demonstrated in other articles.
These innovations technologies tolls together can support your business to drive decision for the next level.
Benefits of Credit Analysis using Machine Learning
Minimize the Credit Risk
Evaluate the cost of loss of sale versus cost of unpaid credit
Cost Reductions with Human Resource
Increased sales through credit approval from customers previously denied
Increased sales and by adjusting the ideal credit limit
Improved cash flow
Opportunity cost reduction,etc.