Loan Eligibility Analysis#

Tentang Dataset#

Dataset berikut digunakan untuk memprediksi apakah calon peminjam memenuhi syarat untuk disetujui peminjamannya berdasarkan wilayah, keuangan, dan faktor yang berhubungan dengan kredit lainnta. Dataset ini cocok untuk klasifikasi menggunakan machine learning.

Dataset diperoleh dari repository Kaggle pada link berikut

https://www.kaggle.com/datasets/avineshprabhakaran/loan-eligibility-prediction

Dataset memiliki cakupan data sebagai berikut

Column

Description

Example

Customer_ID

Unique identifier for each loan applicant

569

Gender

Gender of the applicant

Male / Female

Married

Marital status of the applicant

Yes / No

Dependents

Number of dependents

0, 1, 2, 3

Education

Education level of the applicant

Graduate / Not Graduate

Self_Employed

Whether the applicant is self-employed

Yes / No

Applicant_Income

Applicant’s monthly income

5000

Coapplicant_Income

Coapplicant’s monthly income

1500

Loan_Amount

Loan amount requested (in thousands)

128

Loan_Amount_Term

Loan repayment term (in months)

360

Credit_History

Credit history meets lending criteria

(1 = Yes, 0 = No) 1

Property_Area

Type of property area

Urban / Semiurban / Rural

Loan_Status

Loan approved or not (target variable)

Y / N

Import the library#

[1]:
import kagglehub
import os

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
import matplotlib as mpl
import seaborn as sns

Unduh Dataset#

[2]:
path = kagglehub.dataset_download("avineshprabhakaran/loan-eligibility-prediction")
filename = os.listdir(path)[0]

fullpath = "/".join([path,filename])
fullpath
Downloading to /root/.cache/kagglehub/datasets/avineshprabhakaran/loan-eligibility-prediction/3.archive...
100%|██████████| 7.39k/7.39k [00:00<00:00, 13.1MB/s]
Extracting files...

[2]:
'/root/.cache/kagglehub/datasets/avineshprabhakaran/loan-eligibility-prediction/versions/3/Loan Eligibility Prediction.csv'

Set the default plot style#

[3]:
def DefaultPlotStyle():
  plt.rcParams['lines.linewidth'] = 2
  plt.rcParams['lines.linestyle'] = '-'
  plt.rcParams['figure.figsize'] = [12,10]
  plt.rcParams['font.size'] = 12

DefaultPlotStyle()

Membaca Dataset#

[4]:
df = pd.read_csv(fullpath)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   Customer_ID         614 non-null    int64
 1   Gender              614 non-null    object
 2   Married             614 non-null    object
 3   Dependents          614 non-null    int64
 4   Education           614 non-null    object
 5   Self_Employed       614 non-null    object
 6   Applicant_Income    614 non-null    int64
 7   Coapplicant_Income  614 non-null    float64
 8   Loan_Amount         614 non-null    int64
 9   Loan_Amount_Term    614 non-null    int64
 10  Credit_History      614 non-null    int64
 11  Property_Area       614 non-null    object
 12  Loan_Status         614 non-null    object
dtypes: float64(1), int64(6), object(6)
memory usage: 62.5+ KB

Pengecekan Dataset#

[5]:
# Checking missing value
df.isnull().sum().reset_index().T
[5]:
0 1 2 3 4 5 6 7 8 9 10 11 12
index Customer_ID Gender Married Dependents Education Self_Employed Applicant_Income Coapplicant_Income Loan_Amount Loan_Amount_Term Credit_History Property_Area Loan_Status
0 0 0 0 0 0 0 0 0 0 0 0 0 0

Tidak ada data atau nilai yang kosong dalam dataset ini

[6]:
# Checking duplicate data
any(df.duplicated())
[6]:
False
[7]:
# Checking duplicate customer id
df['Customer_ID'].value_counts().reset_index().sort_values(by='count', ascending=False).head()
[7]:
Customer_ID count
613 271 1
0 606 1
1 569 1
2 15 1
3 95 1

Tidak ada data yang duplikat dalam dataset ini

Tampilan Dataset#

[8]:
df.head()
[8]:
Customer_ID Gender Married Dependents Education Self_Employed Applicant_Income Coapplicant_Income Loan_Amount Loan_Amount_Term Credit_History Property_Area Loan_Status
0 569 Female No 0 Graduate No 2378 0.0 9 360 1 Urban N
1 15 Male Yes 2 Graduate No 1299 1086.0 17 120 1 Urban Y
2 95 Male No 0 Not Graduate No 3620 0.0 25 120 1 Semiurban Y
3 134 Male Yes 0 Graduate Yes 3459 0.0 25 120 1 Semiurban Y
4 556 Male Yes 1 Graduate No 5468 1032.0 26 360 1 Semiurban Y

Analisis Parameter#

[9]:
# Creating object feature for Credit History
mapper = lambda x : "YES" if x == 1 else "NO"
df['Credit_History_Obj'] = df['Credit_History'].map(mapper)
df.head()
[9]:
Customer_ID Gender Married Dependents Education Self_Employed Applicant_Income Coapplicant_Income Loan_Amount Loan_Amount_Term Credit_History Property_Area Loan_Status Credit_History_Obj
0 569 Female No 0 Graduate No 2378 0.0 9 360 1 Urban N YES
1 15 Male Yes 2 Graduate No 1299 1086.0 17 120 1 Urban Y YES
2 95 Male No 0 Not Graduate No 3620 0.0 25 120 1 Semiurban Y YES
3 134 Male Yes 0 Graduate Yes 3459 0.0 25 120 1 Semiurban Y YES
4 556 Male Yes 1 Graduate No 5468 1032.0 26 360 1 Semiurban Y YES
[10]:
# Selecting only object features
object_list = df.select_dtypes(include=['object']).columns
object_list
[10]:
Index(['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area',
       'Loan_Status', 'Credit_History_Obj'],
      dtype='object')
[11]:
# Plotting object features
ndata = df.shape[0]
nobject = len(object_list)
n_cols = 3
n_rows = (nobject // n_cols) + 1

fig, ax = plt.subplots(nrows=n_rows, ncols=n_cols)
ax_flat = ax.flatten()

for i, ax in enumerate(ax_flat):
  if i < nobject:
    df_obj = df[object_list[i]].value_counts().reset_index()
    df_obj['percentage'] = (df_obj['count']/df_obj['count'].sum()) * 100
    ax = ax_flat[i]
    ax.pie(df_obj['percentage'], labels=df_obj[object_list[i]], autopct="%.1f%%")
    ax.set_title(object_list[i].replace("_", " "))
  else:
    ax.axis('off')

plt.tight_layout()
_images/Loan_Eligibility_Analysis_20_0.png
[12]:
# Calculating the total income from the applicant income and co-applicant income
df['Total_Income'] = df['Applicant_Income'] + df['Coapplicant_Income']
df.head()
[12]:
Customer_ID Gender Married Dependents Education Self_Employed Applicant_Income Coapplicant_Income Loan_Amount Loan_Amount_Term Credit_History Property_Area Loan_Status Credit_History_Obj Total_Income
0 569 Female No 0 Graduate No 2378 0.0 9 360 1 Urban N YES 2378.0
1 15 Male Yes 2 Graduate No 1299 1086.0 17 120 1 Urban Y YES 2385.0
2 95 Male No 0 Not Graduate No 3620 0.0 25 120 1 Semiurban Y YES 3620.0
3 134 Male Yes 0 Graduate Yes 3459 0.0 25 120 1 Semiurban Y YES 3459.0
4 556 Male Yes 1 Graduate No 5468 1032.0 26 360 1 Semiurban Y YES 6500.0
[13]:
fig, ax = plt.subplots(figsize=(12,6))
sns.countplot(data=df, x='Gender', hue='Loan_Status', ax=ax)
for container in ax.containers:
  ax.bar_label(container, label_type='edge')
# ax.set_ylim([0,200])
ax.set_ylabel('Number of Applicants')
ax.yaxis.set_minor_locator(MultipleLocator(25))
plt.grid(which='both')
plt.legend(title='Loan Status')
ax.set_axisbelow(True)
_images/Loan_Eligibility_Analysis_22_0.png

Berdasarkan jenis kelamin, peminjaman yang berhasil dilakukan secara mayoritas oleh pelanggan laki-kali, hanya kurang dari 100 pelanggan wanita yang berhasil mendapatkan pinjaman.

[14]:
fig, ax = plt.subplots(figsize=(12,6))
sns.countplot(data=df, x='Education',hue='Loan_Status', ax=ax)
for container in ax.containers:
  ax.bar_label(container, label_type='edge')
# ax.set_ylim([0,200])
ax.set_ylabel('Number of Applicants')
ax.yaxis.set_minor_locator(MultipleLocator(25))
plt.grid(which='both')
plt.legend(title='Loan Status')
ax.set_axisbelow(True)
_images/Loan_Eligibility_Analysis_24_0.png

Berdasarkan status pendidikan, pelanggan yang lulus kuliah lebih berpeluang menerima pinjaman dibandingkan dengan pelanggan yang tidak lulus kuliah.

[15]:
fig, ax = plt.subplots(figsize=(12,6))
sns.countplot(data=df, x='Self_Employed', hue='Loan_Status', ax=ax)
for container in ax.containers:
  ax.bar_label(container, label_type='edge')
# ax.set_ylim([0,200])
ax.set_ylabel('Number of Applicants')
ax.yaxis.set_minor_locator(MultipleLocator(25))
plt.grid(which='both')
plt.legend(title='Loan Status')
ax.set_axisbelow(True)
_images/Loan_Eligibility_Analysis_26_0.png

Berdasarkan status pekerjaan, pelanggan dengan status bukan wiraswasta berpeluang mendapatkan pinjaman.

[16]:
fig, ax = plt.subplots(figsize=(12,6))
sns.countplot(data=df, x='Property_Area', hue='Loan_Status', ax=ax)
for container in ax.containers:
  ax.bar_label(container, label_type='edge')
# ax.set_ylim([0,200])
ax.set_ylabel('Number of Applicants')
ax.yaxis.set_minor_locator(MultipleLocator(25))
plt.grid(which='both')
ax.set_axisbelow(True)
_images/Loan_Eligibility_Analysis_28_0.png

Berdasarkan wilayah tempat tinggal, peminjaman banyak diajukan oleh pelanggan yang tinggal di wilayah semiurban, diikuti dengan pelanggan yang tinggal di wilayah urban dan pedesaan.

[17]:
fig, ax = plt.subplots(figsize=(12,6))
sns.countplot(data=df, x='Dependents', hue='Loan_Status', ax=ax)
for container in ax.containers:
  ax.bar_label(container, label_type='edge')
# ax.set_ylim([0,200])
ax.set_ylabel('Number of Applicants')
ax.yaxis.set_minor_locator(MultipleLocator(25))
plt.grid(which='both')
ax.set_axisbelow(True)
_images/Loan_Eligibility_Analysis_30_0.png

Mayoritas peminjaman yang diajukan dan diterima berasal dari pelanggan yang tidak memiliki tanggungan.

[18]:
fig, ax = plt.subplots(figsize=(12,6))
sns.countplot(data=df, x='Married', hue='Loan_Status', ax=ax)
for container in ax.containers:
  ax.bar_label(container, label_type='edge')
# ax.set_ylim([0,200])
ax.set_ylabel('Number of Applicants')
ax.yaxis.set_minor_locator(MultipleLocator(25))
plt.grid(which='both')
ax.set_axisbelow(True)
_images/Loan_Eligibility_Analysis_32_0.png

Berdasarkan status pernikahan, pelanggan dengan status sudah menikah berpeluang menerima pinjaman.

[20]:
fig, ax = plt.subplots(figsize=(12,6))
sns.countplot(data=df, x='Credit_History_Obj', hue='Loan_Status', ax=ax)
for container in ax.containers:
  ax.bar_label(container, label_type='edge')
# ax.set_ylim([0,200])
ax.set_ylabel('Number of Applicants')
ax.yaxis.set_minor_locator(MultipleLocator(25))
plt.grid(which='both')
ax.set_axisbelow(True)
_images/Loan_Eligibility_Analysis_34_0.png

Berdasarkan rekam jejak peminjaman, pelanggan yang pernah melakukan peminjaman berpeluang mendapatkan pinjaman kembali.

[21]:
# Selecting only numeric data
numeric_list = df.select_dtypes(exclude=['object']).columns
numeric_list = numeric_list.delete(0)
numeric_list
[21]:
Index(['Dependents', 'Applicant_Income', 'Coapplicant_Income', 'Loan_Amount',
       'Loan_Amount_Term', 'Credit_History', 'Total_Income'],
      dtype='object')
[22]:
df_numeric = df.select_dtypes(exclude=['object'])
df_numeric.head()
[22]:
Customer_ID Dependents Applicant_Income Coapplicant_Income Loan_Amount Loan_Amount_Term Credit_History Total_Income
0 569 0 2378 0.0 9 360 1 2378.0
1 15 2 1299 1086.0 17 120 1 2385.0
2 95 0 3620 0.0 25 120 1 3620.0
3 134 0 3459 0.0 25 120 1 3459.0
4 556 1 5468 1032.0 26 360 1 6500.0
[23]:
df_numeric.drop('Customer_ID', axis=1, inplace=True)
df_numeric.head()
[23]:
Dependents Applicant_Income Coapplicant_Income Loan_Amount Loan_Amount_Term Credit_History Total_Income
0 0 2378 0.0 9 360 1 2378.0
1 2 1299 1086.0 17 120 1 2385.0
2 0 3620 0.0 25 120 1 3620.0
3 0 3459 0.0 25 120 1 3459.0
4 1 5468 1032.0 26 360 1 6500.0
[24]:
df.describe()
[24]:
Customer_ID Dependents Applicant_Income Coapplicant_Income Loan_Amount Loan_Amount_Term Credit_History Total_Income
count 614.000000 614.000000 614.000000 614.000000 614.000000 614.000000 614.000000 614.000000
mean 307.500000 0.856678 5403.459283 1621.245798 142.022801 338.892508 0.850163 7024.705081
std 177.390811 1.216651 6109.041673 2926.248369 87.083089 69.716355 0.357203 6458.663872
min 1.000000 0.000000 150.000000 0.000000 9.000000 12.000000 0.000000 1442.000000
25% 154.250000 0.000000 2877.500000 0.000000 98.000000 360.000000 1.000000 4166.000000
50% 307.500000 0.000000 3812.500000 1188.500000 125.000000 360.000000 1.000000 5416.500000
75% 460.750000 2.000000 5795.000000 2297.250000 164.750000 360.000000 1.000000 7521.750000
max 614.000000 4.000000 81000.000000 41667.000000 700.000000 480.000000 1.000000 81000.000000
[25]:
number_Coapplicant_Income_zero = df[df['Coapplicant_Income'] == 0].shape[0]
number_Coapplicant_Income_nonzero = df.shape[0] - number_Coapplicant_Income_zero

percentage_Coapplicant_Income_zero = (number_Coapplicant_Income_zero/df.shape[0])*100
percentage_Coapplicant_Income_nonzero = (number_Coapplicant_Income_nonzero/df.shape[0])*100

fig, ax = plt.subplots(figsize=(5,5))
ax.pie(x= [percentage_Coapplicant_Income_zero, percentage_Coapplicant_Income_nonzero],
       labels=['Do not have income', 'Have income'],
       autopct="%.1f%%");
ax.set_title('Coapplicant Income')
[25]:
Text(0.5, 1.0, 'Coapplicant Income')
_images/Loan_Eligibility_Analysis_40_1.png
[26]:
# Encoding the object data to numeric
df_object = df.select_dtypes(include=['object'])
df_object.head()
[26]:
Gender Married Education Self_Employed Property_Area Loan_Status Credit_History_Obj
0 Female No Graduate No Urban N YES
1 Male Yes Graduate No Urban Y YES
2 Male No Not Graduate No Semiurban Y YES
3 Male Yes Graduate Yes Semiurban Y YES
4 Male Yes Graduate No Semiurban Y YES
[27]:
gender_encoder = lambda x : 1 if x == 'Male' else 0
married_encoder = lambda x : 1 if x == 'Yes' else 0
education_encoder = lambda x : 1 if x == 'Graduate' else 0
self_employed_encoder = lambda x : 1 if x == 'Yes' else 0
property_area_encoder = lambda x : 1 if x == 'Urban' else 2 if x == 'Semiurban' else 3
loan_status_encoder = lambda x : 1 if x == 'Y' else 0

mappers = [gender_encoder, married_encoder, education_encoder, self_employed_encoder, property_area_encoder, loan_status_encoder]

df_object_names = df_object.columns.to_list()
df_object_names.remove('Credit_History_Obj')
df_object_names
[27]:
['Gender',
 'Married',
 'Education',
 'Self_Employed',
 'Property_Area',
 'Loan_Status']
[28]:
df_duplicate = df.copy(deep=True)

for i, obj in enumerate(df_object_names):
  df_duplicate[obj] = df_duplicate[obj].map(mappers[i])

df_duplicate.drop(['Customer_ID','Credit_History_Obj'], axis=1, inplace=True)


[29]:
df_duplicate.head()
[29]:
Gender Married Dependents Education Self_Employed Applicant_Income Coapplicant_Income Loan_Amount Loan_Amount_Term Credit_History Property_Area Loan_Status Total_Income
0 0 0 0 1 0 2378 0.0 9 360 1 1 0 2378.0
1 1 1 2 1 0 1299 1086.0 17 120 1 1 1 2385.0
2 1 0 0 0 0 3620 0.0 25 120 1 2 1 3620.0
3 1 1 0 1 1 3459 0.0 25 120 1 2 1 3459.0
4 1 1 1 1 0 5468 1032.0 26 360 1 2 1 6500.0

Applying Logistic Regression#

[30]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_curve, roc_auc_score
from sklearn.preprocessing import StandardScaler
import numpy as np
[31]:
target = df_duplicate['Loan_Status']
features = df_duplicate.drop('Loan_Status', axis=1)

# Splitting the dataset, the test dataset is 20% of the total dataset
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Applying the standar scaler for the scaling unbalance scale of dataset
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Applying the logistic regression
linreg_model = LogisticRegression()
linreg_model.fit(X_train_scaled,y_train)

# Predicting the test dataset (scaled)
y_pred = linreg_model.predict(X_test_scaled)

# Evaluating the accuracy of the model
print(f"Accuracy: {accuracy_score(y_pred, y_test)}")
Accuracy: 0.8211382113821138
[32]:
# Providing the confusion matrix

# true negatives is C_{0,0},
# false positives is C_{0,1},
# false negatives is C_{1,0},
# true positives is C_{1,1}

custom_x_labels = ['Negative', 'Positve']
custom_y_labels = ['Negative', 'Positve']

fig, ax = plt.subplots(figsize=(6,6))
sns.heatmap(confusion_matrix(y_test, y_pred),
            xticklabels=custom_x_labels,
            yticklabels=custom_y_labels,
            annot=True,
            ax=ax)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
[32]:
Text(41.722222222222214, 0.5, 'Actual')
_images/Loan_Eligibility_Analysis_48_1.png
[33]:
# Generating the classification report
print(classification_report(y_pred, y_test, target_names=['Loan Status-No', 'Loan Status-Yes']));
                 precision    recall  f1-score   support

 Loan Status-No       0.47      0.95      0.63        20
Loan Status-Yes       0.99      0.80      0.88       103

       accuracy                           0.82       123
      macro avg       0.73      0.87      0.76       123
   weighted avg       0.90      0.82      0.84       123

Akurasi model sekitar 82% untuk memprediksi target (status peminjaman). Model ini cocok untuk memprediksi target sebenarnya (status pinjaman yang dapat diterima) sensitivitas (recall) 80%, tetapi lebih sensitif untuk memprediksi target status pinjaman yang tidak dapat diterima. Namun, model ini mampu memprediksi status pinjaman yang dapat diterima dengan presisi 99%.

[34]:
# Plotting the ROC curve
y_pred_proba = linreg_model.predict_proba(X_test_scaled)[::,1]
fpr, tpr, _ = roc_curve(y_test,  y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)

fig, ax = plt.subplots(figsize=(6,6))
ax.plot(fpr,tpr,label=f"data 1, auc={auc:.2f}")
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('Reciever Operating Characteristic (ROC) Curve')
plt.legend(loc=4)
plt.show()
_images/Loan_Eligibility_Analysis_51_0.png