Credit Scoring Walk-through - Part One

A credit scoring model is a statistical tool widely used by lenders to assess the creditworthiness of their potential and existing customers. The basic idea behind this model is that various demographic attributes and past repayment behaviour of an individual can be utilized to predict hers or his probability of default.

This blog post will mostly cover the first phase of every model development project: data exploration and preparation.

Data Set: Home Equity Loans

For demonstration purposes, we will be using the data set that contains characteristics and delinquency information for 5,960 home equity loans (source: B. Baesens, D. Roesch, H. Scheule, Credit Risk Analytics: Measurement Techniques, Applications and Examples in SAS, Wiley, 2016). The data set variables are:

  • BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid loan
  • LOAN: Amount of the loan request
  • MORTDUE: Amount due on existing mortgage
  • VALUE: Value of current property
  • REASON: DebtCon = debt consolidation; HomeImp = home improvement
  • JOB: Occupational categories
  • YOJ: Years at present job
  • DEROG: Number of major derogatory reports
  • DELINQ: Number of delinquent credit lines
  • CLAGE: Age of oldest credit line in months
  • NINQ: Number of recent credit inquiries
  • CLNO: Number of credit lines
  • DEBTINC: Debt-to-income ratio

The binary variable BAD will be the target variable in our credit scoring model, while other variables will be used as predictors. The first six observations (i.e. loans) of the data set are given below:

Home Equity Loans data set
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC BADn
bad 1,100 25,860 39,025 HomeImp Other 10.5 0 0 94.4 1 9 1
bad 1,300 70,053 68,400 HomeImp Other 7 0 2 121.8 0 14 1
bad 1,500 13,500 16,700 HomeImp Other 4 0 0 149.5 1 10 1
bad 1,500 1
good 1,700 97,800 112,000 HomeImp Office 3 0 0 93.3 0 14 0
bad 1,700 30,548 40,320 HomeImp Other 9 0 0 101.5 1 8 37.1 1

Next, we will take a look at some basic summary statistics for our variables:

##    BAD            LOAN          MORTDUE           VALUE       
##  good:4771   Min.   : 1100   Min.   :  2063   Min.   :  8000  
##  bad :1189   1st Qu.:11100   1st Qu.: 46276   1st Qu.: 66076  
##              Median :16300   Median : 65019   Median : 89236  
##              Mean   :18608   Mean   : 73761   Mean   :101776  
##              3rd Qu.:23300   3rd Qu.: 91488   3rd Qu.:119824  
##              Max.   :89900   Max.   :399550   Max.   :855909  
##                              NA's   :518      NA's   :112     
##      REASON          JOB            YOJ             DEROG        
##         : 252          : 279   Min.   : 0.000   Min.   : 0.0000  
##  DebtCon:3928   Mgr    : 767   1st Qu.: 3.000   1st Qu.: 0.0000  
##  HomeImp:1780   Office : 948   Median : 7.000   Median : 0.0000  
##                 Other  :2388   Mean   : 8.922   Mean   : 0.2546  
##                 ProfExe:1276   3rd Qu.:13.000   3rd Qu.: 0.0000  
##                 Sales  : 109   Max.   :41.000   Max.   :10.0000  
##                 Self   : 193   NA's   :515      NA's   :708      
##      DELINQ            CLAGE             NINQ             CLNO     
##  Min.   : 0.0000   Min.   :   0.0   Min.   : 0.000   Min.   : 0.0  
##  1st Qu.: 0.0000   1st Qu.: 115.1   1st Qu.: 0.000   1st Qu.:15.0  
##  Median : 0.0000   Median : 173.5   Median : 1.000   Median :20.0  
##  Mean   : 0.4494   Mean   : 179.8   Mean   : 1.186   Mean   :21.3  
##  3rd Qu.: 0.0000   3rd Qu.: 231.6   3rd Qu.: 2.000   3rd Qu.:26.0  
##  Max.   :15.0000   Max.   :1168.2   Max.   :17.000   Max.   :71.0  
##  NA's   :580       NA's   :308      NA's   :510      NA's   :222   
##     DEBTINC              BADn       
##  Min.   :  0.5245   Min.   :0.0000  
##  1st Qu.: 29.1400   1st Qu.:0.0000  
##  Median : 34.8183   Median :0.0000  
##  Mean   : 33.7799   Mean   :0.1995  
##  3rd Qu.: 39.0031   3rd Qu.:0.0000  
##  Max.   :203.3121   Max.   :1.0000  
##  NA's   :1267

As we can see, there is a decent amount of missing values (NA’s). So, what should we do? The easiest approach, obviously, is to delete all the observations that contain them (e.g. na.omit() in R would do the trick) and then act like they have never existed. But if we actually decided to do this, we would loose almost two-fifths of our initial data set! Another option is to impute (i.e. replace) the missing values with some logical values. For example, we could use the mean or median to replace NA values for the continuous variables, or mode for the categorical ones. Or we could apply more complex imputation methods, such as multivariate imputation by chained equations (in R the mice package implements this method). Unfortunately, in most cases imputation methods, even the most sophisticated ones (that sip Merlot while reading Walt Whitman poems next to the fireplace), add little, if anything, to the model’s performance. Lastly, the third approach is to keep the missing values in the data set, as they can actually be quite informative. For instance, if someone’s income is missing, the explanation could be that he or she is currently unemployed. If missing values are related to the target variable, then the best course of action is to keep them in the model development data set (we will see in Part Two how this can be stylishly achieved with Weight of evidence coding).

But before we deal with our precious missing data, we should fully direct attention to non-missing data. After all, models are all about non-missing data (No offense, dear NA’s). Thus, let us now draw some nice-looking (gg)plots.

Data Exploration

The first variable we analyse is BAD, which will be the target variable in our credit scoring model.

plot of chunk bp_bad-1.png

This simple bar plot shows that the proportion of bad applicants (those who are seriously delinquent or already defaulted) in the data set is about 20%. The total number of ‘bads’ (roughly 1,200) is more or less acceptable for the model development.

“Typically about a minimum of 2,000 each of goods, bads, and rejects are sufficient for scorecard development.” (N. Siddiqi, Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards, Wiley, 2017)

Our next step is to visually inspect the relationship between the target variable and other variables in the data set. For the categorical variables and continuous variables with relatively small number of possible values we will use bar plots. On the other hand, for the “regular” continuous variables we will produce their density curves.

It is obvious that the bad rate ( bads/(goods+bads) ) is roughly the same no matter what is the reason for a loan. In fact, calculated Cramer’s V value (CramerV() from DescTools was used) of just 0.04 confirms this conclusion, with 0.10 commonly used in industry as a cutoff value. Furthermore, it seems that missing values, in this case, care no valuable information.

Clearly, the occupation of the applicant matters (Cramer’s V = 0.13). For instance, the bad rate for a salesperson is almost three times as high as for an office worker. Also, in this data set the lowest risk, surprisingly, exhibit the obligors whose occupation information is not available.

As expected, the higher the number of major derogatory reports, the greater the risk. This variable is continuous, but it could be easily, and probably without much loss of information, converted to ordinal factor variable (e.g. zero, one, and two or more reports).

The variables DELINQ and NINQ exhibit similar behaviour as previously mentioned DEROG variable, so they too can be transformed into discrete variables.

It seems that the applicants who took smaller home equity loans are somewhat riskier than those with larger ones. One possible explanation could be that application process for the higher loans is more rigorous.

At first glance, the amount due on the mortgage and value of current property don’t seem to be especially related to the likelihood that applicant will repay a debt.

On the other hand, people who don’t change their jobs so often seem to be safer bet for the lender.

Further, it appears that the number of credit lines and age of the oldest credit line are to some extent related to the individual’s credit risk.

Finally, debt-to-income ratio is definitely positively associated with the probability of default on home equity loan.

In addition, we can run some statistical test to evaluate the predictive power of the discussed continuous variables. Individual t-test results are shown in the following table:

variable t.stat p.value
LOAN 5.72 0
MORTDUE 3.38 0.001
VALUE 1.86 0.06
YOJ 4.66 0.000003
DEROG -12.76 0
DELINQ -17.01 0
CLAGE 12.96 0
NINQ -10.53 0
CLNO 0.28 0.78
DEBTINC -6.90 0

Well, that’s all great, but what about those missing values? They are not present in any of displayed visual and statistical analysis. T-test just doesn’t give a damn about them, but we should. Hence, for each continuous variable we calculate the bad rate for missing data and compare it with the bad rate for available data.

variable bad.rate.non.missing bad.rate.missing
LOAN 19.95 NA
MORTDUE 19.9 20.46
VALUE 18.54 93.75
YOJ 20.64 12.62
DEROG 20.98 12.29
DELINQ 20.76 12.41
CLAGE 19.66 25.32
NINQ 20.44 14.71
CLNO 19.8 23.87
DEBTINC 8.59 62.04

The most notable differences in bad rates are present for the value of current property (VALUE) and debt-to-income ratio (DEBTINC). The missing data bad rate for VALUE is striking 94%, although it is obtained for only 112 loans, or less than 2% of the data set. On the other hand, this bad rate is 62% for DEBTINC, but it is attributed to more than one-fifth of all loans. All things considered, the missing values should certainly be kept in the data set.

Training and Test Set

Before proceeding to model development, it is necessary to divide our data set into training and test set. Model(s) will be then developed using exclusively the training sample, while the test sample will be used for model validation. In our case, training sample will contain three-quarters of the data set and it will be stratified by target variable, therefore bad rate will be the same for both samples (in R this can be accomplished with createDataPartition() function from caret package). The final partition is shown in the bar plot.

End of Part One

So far we have thoroughly explored the data set, gained insights into its characteristics and possible caveats, visually and statistically examined associations of our soon-to-be predictors with the target variable, and created training and test sample, which will be used in model development and validation process. In Part Two we are going to introduce weight of evidence coding, apply it to the independent variables and build our first credit scoring model.

- to blog -

blog built using the cayman-theme by Jason Long. LICENSE