Theoretical Foundation

What is credit risk and why is it important?

The likelihood that a borrower would not repay their loan to the lender ➔ the lenders will not receive their owned principle, moreover, they wouldn’t be paid the interest and will therefore suffer a substantial loss ➔ credit risk.
In addition, it is likely that the lender will have to sustain substaintial costs in an effort to recover outstanding debt ➔ collection costs.
When a borrower is not able to make the required payments to repay their debt ➔ default.

Some ways of lenders to protect themselves against credit losses:
• measure credit risk well,
• require collaterals,
• increase the price of lending the funds (the interest rate).

Expected loss (EL) and its components: PD, LGD and EAD

Expected loss is the amount a lender might lose by lending to a borrower.

EL = PD x LGD x EAD

Where:
PD – Probability of default: Brrowers inability to repat their debt in full or on time. PD is the estimate of the likelihood that the borrower would default. For the PD model, I need an indicator or a flag whether the borrower defaulted or not.
LGD – Loss given default: The proportion of the total exposure that cannot be recovered by the lender once a default has occured. LGD is the share of an asset that it lost if a borrower defaults. For the LGD model, I need to calculate how much of the loan was recovered after the borrower had defaulted. This information is contained in the recoveries column, so that will be our dependent variable.
EAD – Exposure at default: The total value that a lender is exposed to when a borrower default. EAD is the maximum that a bank may lose when a borrower defaults on a loan. For the EAD model, I must calculate the total exposure. At the moment the borrower defaulted compared to the total exposure in the past. You can find relevant information in the total recovered principle column. Of course I can use all other variables I have.

For example:
A borrower wants to buy a house: $500,000.
A bank funds 80% of a purchase, Loan-to-Value (LTV) = $500,000 x 80% = $400,000.
That borrower has to repay 10% = $400,000 x 10% = $40,000.
The outstanding balance = $400,000 – $40,000 = $360,000.
If the borrower defaults ➔ EAD = $360,000.
Assume that there is empirical evidence that one in four homeowners have defaulted in previous years. So PD = 1/4 = 25%.
If the borrowers defaults, the bank can sell the house immediately for $342,000 ➔ the bank can recover $342,000 ➔ the remaining loss = $360,000 – $342,000 = $18,000 ➔ LGD = $18,000/$360,000 = 5%.
➔ EL = PD x LGD x EAD = 25% x 5% x $360,000 = $45,000.

Capital adequacy, regulations, and the Basel II accord

To avoid all of things which could make economy paralyzed, regulators have come up with a set of rules that have two main goals:
(1) Regulate band operations and hence reduce risky behavior,
(2) Guarantee to the public that the banking system is in good health.

Capital requirement: Require banks to hold enough capital that would allow them to absorb the losses from defaults.
Capital adequacy ratio – CAR = Capital/Risk-Weighted assest >= 8%

The primary objective of the Basel II Accord is to ensure that the capital allocation bank carry out its risk sensitive. The greater the risk a bank is exposed to, the greater the amount of capital the bank needs to hold to safeguard its solvency and overall economic stability.
The three pillars of the Basel II Accord are:
(1) Minimum Capital Requirements,
(2) Supervisory Review,
(3) Market Discipline.

The three main types od risk banks face are:
(1) Credit Risk,
(2) Operational Risk,
(3) Market Risk

Basel II approaches: SA, F-IRB, and A-IRB

The Basel two accord prescribes that regulators should allow banks to choose from three different approaches for calculating or modeling credit risk.

That is, they can choose from one of three different approaches for calculating or modeling each of the three components of the expected loss. These three approaches are called Standardized Approach (SA), Foundation Internal Ratings Based approach (F-IRB), Advanced Internal Ratings Based approach (A-IRB).

Under the standardised approach, the capital that has to be held is prescribed as a percentage of the total exposure under the internal ratings based approaches. Expected loss, which is the product of PD, LGD, and EAD is calculated and the bank’s capital should be sufficient to cover expected losses.
Under the standardized approach, banks are required to use data from external credit agencies to assess credit risk.
The other two approaches are called internal ratings based because the standardized approach relies entirely on external ratings provided by credit agencies such as Fico, credit score for individuals and households. Other similar agencies are S&P, Moody’s and Fitch.

Presenting information about credit risk: The credit risk of every entity is represented by its Credit Rating. The lower the entity’s credit rating, the higher its credit risk. And hence one can infer that the entity has a low credit worthiness. For individuals, these credit ratings come in the form of credit scores. The most popular credit score is perhaps the Fico score. This is the credit score calculated and provided by a company called Fico. You may have heard that everyone has a Fico score, at least in the US. The score ranges between 300 and 850.

https://www.investopedia.com/terms/f/ficoscore.asp#:~:text=Scores%20range%20from%20300%20to,of%20different%20types%20of%20credit.

Please visit this link to calculate the FICO score of a borrower.

For firms, financial institutions and countries. Credit ratings are shown as letters. A good example is the S&P credit rating scale. The entities with the lowest credit risk and highest credit worthiness are rated as triple-A, and the ones with the highest credit risk and the lowest credit worthiness are rated with a D.

Source: https://www.sasol.com/investor-centre/credit-rating

	SA	F-IRB	A-IRB
Probability of Default	Externally provided	Internally estimated	Internally estimated
Loss Given Default	Externally provided	Externally provided	Internally estimated
Exposure At Default	Externally provided	Externally provided	Internally estimated

IRB approaches:
• allow banks to establish their own credit ratings,
• precise calculations about the held capital for each individual exposure,
• allocate resources to cover losses.

Different facility types (asset classes) and credit risk modeling approaches

Under the standardized approach, particular types of borrowers and products are being treated in a different manner in terms of capital requirement calculations.For instance, I saw on one hand that the equivalent of 20% of the exposure towards every company rated from triple A to AA minus has to be held as capital. 50% of the exposure towards every company rated from A+ to a minus has to be held as capital and so on. On the other hand, for retail exposures of individuals, banks have to hold as much as 75% of capital of each exposure, and for mortgages, as much capital as 35% of each exposure.

As you can see, there are different approaches and rates to calculate capital requirements depending on what the borrower is a company, a corporation, another financial institution or a country.

The calculations also differ by product types, also called facility types. Even if the borrowers are of the same type, for example, both retail loans and mortgages are given to individuals, but retail exposures have a 75% weight, and mortgages have a 35% rate. Similarly, the internal ratings based IRB approaches may use different methods and statistical models for different types of borrowers and facility types.

Application model: Application models are used to estimate a firm’s credit rating at the moment of application. The estimated credit ratings in turn, are the basis on which banks decide whether to grant a loan or not. A bank may also use the estimated credit ratings to decide how to price the loan. That is what interest rate to charge for the respective loan. This is known as risk based pricing. The riskier a loan is, the higher will be its price, the higher the interest rate charged to the customer.

Behavior model: Behavioral models, on the other hand, are used to calculate probability of default and respectively expected loss after a loan is granted. Banks may also use behavioral models to decide whether to grant an additional loan to an existing customer.

Data

Introduction

Download dataset here.
The dataset contains all available data for more than 800,000 consumer loans issued from 2007 to 2015 by Lending Club: a large US peer-to-peer lending company. There are several different versions of this dataset. I have used a version available on kaggle.com. You can find it here:

Depending on the type of data, discrete or continuous, I will have to apply different pre-processing techniques. For our purposes, we’ll distinguish between two types discrete and continuous:
Discrete or categorical variables take only a certain finite number of values.
Continuous or numerical variables, on the other hand, can take any value in a given range, or in other words, they can take on an infinite number of possible values.

Fist, look at the columns and their descriptions:

and data columns’ infomation:

loan_data.info()

#	Columns	Non-Null Count	Dtype
0	Unnamed: 0	466285 non-null	int64
1	id	466285 non-null	int64
2	member_id	466285 non-null	int64
3	loan_amnt	466285 non-null	int64
4	funded_amnt	466285 non-null	int64
5	funded_amnt_inv	466285 non-null	float64
6	term	466285 non-null	object
7	int_rate	466285 non-null	float64
8	installment	466285 non-null	float64
9	grade	466285 non-null	object
10	sub_grade	466285 non-null	object
11	emp_title	438697 non-null	object
12	emp_length	445277 non-null	object
13	home_ownership	466285 non-null	object
14	annual_inc	466281 non-null	float64
15	verification_status	466285 non-null	object
16	issue_d	466285 non-null	object
17	loan_status	466285 non-null	object
18	pymnt_plan	466285 non-null	object
19	url	466285 non-null	object
20	desc	125981 non-null	object
21	purpose	466285 non-null	object
22	title	466264 non-null	object
23	zip_code	466285 non-null	object
24	addr_state	466285 non-null	object
25	dti	466285 non-null	float64
26	delinq_2yrs	466256 non-null	float64
27	earliest_cr_line	466256 non-null	object
28	inq_last_6mths	466256 non-null	float64
29	mths_since_last_delinq	215934 non-null	float64
30	mths_since_last_record	62638 non-null	float64
31	open_acc	466256 non-null	float64
32	pub_rec	466256 non-null	float64
33	revol_bal	466285 non-null	int64
34	revol_util	465945 non-null	float64
35	total_acc	466256 non-null	float64
36	initial_list_status	466285 non-null	object
37	out_prncp	466285 non-null	float64
38	out_prncp_inv	466285 non-null	float64
39	total_pymnt	466285 non-null	float64
40	total_pymnt_inv	466285 non-null	float64
41	total_rec_prncp	466285 non-null	float64
42	total_rec_int	466285 non-null	float64
43	total_rec_late_fee	466285 non-null	float64
44	recoveries	466285 non-null	float64
45	collection_recovery_fee	466285 non-null	float64
46	last_pymnt_d	465909 non-null	object
47	last_pymnt_amnt	466285 non-null	float64
48	next_pymnt_d	239071 non-null	object
49	last_credit_pull_d	466243 non-null	object
50	collections_12_mths_ex_med	466140 non-null	float64
51	mths_since_last_major_derog	98974 non-null	float64
52	policy_code	466285 non-null	int64
53	application_type	466285 non-null	object
54	annual_inc_joint	0 non-null	float64
55	dti_joint	0 non-null	float64
56	verification_status_joint	0 non-null	float64
57	acc_now_delinq	466256 non-null	float64
58	tot_coll_amt	396009 non-null	float64
59	tot_cur_bal	396009 non-null	float64
60	open_acc_6m	0 non-null	float64
61	open_il_6m	0 non-null	float64
62	open_il_12m	0 non-null	float64
63	open_il_24m	0 non-null	float64
64	mths_since_rcnt_il	0 non-null	float64
65	total_bal_il	0 non-null	float64
66	il_util	0 non-null	float64
67	open_rv_12m	0 non-null	float64
68	open_rv_24m	0 non-null	float64
69	max_bal_bc	0 non-null	float64
70	all_util	0 non-null	float64
71	total_rev_hi_lim	396009 non-null	float64
72	inq_fi	0 non-null	float64
73	total_cu_tl	0 non-null	float64
74	inq_last_12m	0 non-null	float64

Hypothesis

The below list is my questioning and provable hypotheses:

Higher grades (e.g., grade:A) significantly increase the likelihood of a positive outcome compared to the baseline grade.
Homeowners with a mortgage are more likely to achieve a positive outcome compared to renters.
Individuals with 10+ years of employment experience are more likely to achieve a positive outcome compared to those with less experience.
Lower interest rates significantly increase the likelihood of a positive outcome compared to higher interest rates.
Lower DTI values (e.g., 1.4-3.5) positively influence the likelihood of a positive outcome.
Individuals with longer credit histories (e.g., more than 271 months) are more likely to achieve a positive outcome.

EDA

Employment Length for Issued Loans:

Longer employment tenure (10+ years) correlates with the highest number of loans, while very short and moderate lengths show notable but lower issuance.

Payment Loan – Loan Amount:

• For target 0, the loan amounts with payment plans (y) have a broader distribution and higher density in the upper range compared to those without payment plans (n), which are more concentrated in the lower range.
• For target 1, the distribution of loan amounts is narrower overall, with payment plans (y) showing higher density at lower loan amounts, while those without payment plans (n) have a more uniform spread.

Amount and Status of Loans

• Fully Paid loans tend to have the largest range of loan amounts, with some outliers reaching very high values (around 35,000).
• Charged Off loans have a narrower range compared to Fully Paid, with most loans clustered around mid-to-high amounts.
• Loans in the Current and Default statuses exhibit relatively uniform ranges, with medians located slightly lower than Fully Paid loans.
• Late payment statuses (16–30 days and 31–120 days) and In Grace Period loans show similar ranges but lower overall loan amounts compared to Fully Paid or Charged Off categories.
• Loans labeled as Does not meet the credit policy are split into two groups (Fully Paid and Charged Off), with their distributions closely resembling their respective main categories.

Frequency of Grade of Loans

• Most of the loans are graded B, C and D.
• Grades E, F, and G have significantly lower counts, indicating they are less common in the dataset. This suggests a skew towards higher loan volume in the mid-tier grades (B, C, D).

Interest Rate Distribution

• For target = 0, the interest rates have a higher density in the range of approximately 12% to 15%, showing a sharper peak and a more concentrated distribution.
• For target = 1, the interest rates are more evenly distributed, with a broader density peak between 15% and 20%, suggesting higher variability compared to target = 0.

Interest Rate and Grade of Loans

• For each grade, the distributions for target = 0 and target = 1 are generally similar, with slight differences in density and spread.
• Grades A and B have the narrowest interest rate ranges (concentrated below 10%), while grades F and G exhibit the widest ranges and highest rates (extending beyond 25%).
• The spread for higher grades (F and G) is larger, suggesting more variability in interest rates, with noticeable overlap between target = 0 and target = 1.

Number of Loans by States

In the majority of states, non-defaulted loans outweigh defaulted ones, with only a small number of states where defaulted loans are comparable to or slightly higher than non-defaulted loans. This indicates a general trend of non-defaulted loans dominating across most states.

Median loan amounts are fairly consistent nationwide. However, Alaska (AK) exhibit greater variability in loan amounts, as indicated by longer whiskers and more outliers. The majority of loan amounts fall within a similar range across states, but outliers—represented as dots outside the whiskers—highlight occasional unusually high or low loan amounts in some states.

Loan Amount vs. Employment Length by Default or Non-Default

• Loan amounts are generally higher for target = 0 (non-defaulted loans) across all employment lengths, with the widest ranges and highest medians seen for borrowers with 10+ years of experience.
• Defaulted loans (target = 1) exhibit greater variability in the mid-range employment lengths (5-9 years), while borrowers with less than 1 year of experience tend to have the smallest loan amounts.

Loan Amount in each Term by Default or Non-Default

• For 36-month terms, loan amounts are generally smaller, with target = 0 (non-defaulted loans) having a slightly higher median and narrower range compared to target = 1 (defaulted loans).

• For 60-month terms, loan amounts are larger overall, with both target = 0 and target = 1 showing similar medians but with greater variability and more outliers in the target = 1 group.

Loan Amount in each Purposes with Default or Non-Default

• Purposes like “small business,” “house,” and “car” have higher median loan amounts and wider ranges, particularly for defaulted loans (target = 1), which often show greater variability and more outliers.
• Lower loan amounts are associated with purposes such as “educational,” “vacation,” and “moving,” where the distributions for target = 0 (non-defaulted loans) are generally tighter and show smaller ranges compared to target = 1.

Loan Amount in Initial List Status by and Target

• The distributions of loan amounts are similar, with slightly higher medians for non-defaulted loans (target = 0) compared to defaulted loans (target = 1).
• Both statuses exhibit comparable ranges and variability, with no significant differences between the two categories.

Loan Amount with Annual Income

• Most borrowers have annual incomes below USD100,000, with loan amounts distributed across a wide range, indicating that lower-income borrowers frequently take loans of varying sizes.
• There is a noticeable density of data points at the lower end of the annual income axis, suggesting a high concentration of lower-income borrowers.

• The majority of borrowers have a loan-to-income ratio between 0.1 and 0.3, making up 10% to 30% of their annual income.
• The density curve overlaid on the histogram highlights a peak around 0.2.
• Very few borrowers have a ratio above 0.4.

Data Preprocessing

Continuous variables Preprocessing

Emp_length: Employment length in years

loan_data['emp_length'].unique()

array(['10+ years', '< 1 year', '1 year', '3 years', '8 years', '9 years',
       '4 years', '5 years', '6 years', '2 years', '7 years', nan],
      dtype=object)

We need to remove the word ‘year’, ‘years’, ‘+ years’ after the number 1…10 by str.replace.

Earliest_cr_line: The month the borrower’s earliest reported credit line was opened

loan_data['earliest_cr_line']

We need to convert the categorical dates to numeric values by pd.to_datetime(). And the calculate the duration between the date of earliest_cr_line and the date I run the model.

Term: The number of payments on the loan

loan_data['term']

Remove the word ‘months’ to convert data to numeric values by pd.to_numeric(loan_data['term'].str.replace(' months', ''))

Issue_d: The month which the loan was funded

loan_data['issue_d']

Convert it to numeric dates by pd.to_datetime(loan_data['issue_d'], format = '%b-%y') and calculate the duration between it and the day I run the model.

Discrete variables processing

Grade: LC assigned loan grade

loan_data['grade'].value_counts()

Create dummy variables by pd.get_dummies(loan_data['grade'], prefix = 'grade', prefix_sep = ':').

Sub_grade: LC assigned loan subgrade

loan_data['sub_grade'].value_counts()

Creating dummy variables by pd.get_dummies(loan_data['sub_grade'], prefix = 'sub_grade', prefix_sep = ':').

Home_ownership: The home ownership status provided by the borrower during registration

loan_data['home_ownership'].value_counts()

Create dummy variables by pd.get_dummies(loan_data['home_ownership'], prefix = 'home_ownership', prefix_sep = ':').

Verification_Status: Indicates if the co-borrowers’ joint income was verified by LC, not verified, or if the income source was verified

loan_data['verification_status'].value_counts()

Create dummy variables by pd.get_dummies(loan_data['verification_status'], prefix = 'verification_status', prefix_sep = ':').

Loan_Status: Current status of the loan

loan_data['loan_status'].value_counts()

Create dummy variables by pd.get_dummies(loan_data['loan_status'], prefix = 'loan_status', prefix_sep = ':').

Purpose: A category provided by the borrower for the loan request

loan_data['purpose'].value_counts()

Create dummy variables by pd.get_dummies(loan_data['purpose'], prefix = 'purpose', prefix_sep = ':').

Addr_State: The state provided by the borrower in the loan application

loan_data['addr_state'].value_counts()

Initial_list_status: The initial listing status of the loan

loan_data['initial_list_status'].value_counts()

Create dummy variables by pd.get_dummies(loan_data['initial_list_status'], prefix = 'initial_list_status', prefix_sep = ':').

Finally, concatenate dummy variables into the dataframe:

# Concatenating dummy variables into a DataFrame
loan_data_dummies = pd.concat([pd.get_dummies(loan_data['grade'], prefix='grade', prefix_sep=':'),
                      pd.get_dummies(loan_data['sub_grade'], prefix='sub_grade', prefix_sep=':'),
                      pd.get_dummies(loan_data['home_ownership'], prefix='home_ownership', prefix_sep=':'),
                      pd.get_dummies(loan_data['verification_status'], prefix='verification_status', prefix_sep=':'),
                      pd.get_dummies(loan_data['loan_status'], prefix='loan_status', prefix_sep=':'),
                      pd.get_dummies(loan_data['purpose'], prefix='purpose', prefix_sep=':'),
                      pd.get_dummies(loan_data['addr_state'], prefix='addr_state', prefix_sep=':'),
                      pd.get_dummies(loan_data['initial_list_status'], prefix='initial_list_status', prefix_sep=':')],
                     axis=1)

Missing values checking and cleaning

total_rev_hi_lim: Total revolving high credit/credit limit

I fill the missing values with the values of funded_amnt column.

loan_data['total_rev_hi_lim'].fillna(loan_data['funded_amnt'], inplace=True)