Data source import complete.

Graphing Functions:  plot_distributions, plot_categorical_features, three_way_cat_cont_interactions, corr_plot_func

Three-way Association Regression Functions:  log_lin_3_way_cat_association, pairwise_loglin_homogenous_posthoc, pairwise_loglin_simple_heterogenous_posthoc, stratified_categorical_logistic_regression

Three-way Association Non-Parametric Functions:  alighed_rank_transform_anova, art_contrast_posthoc_unified

Alternative Three-way Association Regression Functions:  stratified_logistic_regression_with_posthoc

ANOVA Functions:  cust_one_way_ANOVA, cust_two_way_ANOVA

Non-Parametric ANOVA Functions:  kruskal_wallis_with_dunn

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   RowNumber           10000 non-null  int64  
 1   CustomerId          10000 non-null  int64  
 2   Surname             10000 non-null  object 
 3   CreditScore         10000 non-null  int64  
 4   Geography           10000 non-null  object 
 5   Gender              10000 non-null  object 
 6   Age                 10000 non-null  int64  
 7   Tenure              10000 non-null  int64  
 8   Balance             10000 non-null  float64
 9   NumOfProducts       10000 non-null  int64  
 10  HasCrCard           10000 non-null  int64  
 11  IsActiveMember      10000 non-null  int64  
 12  EstimatedSalary     10000 non-null  float64
 13  Exited              10000 non-null  int64  
 14  Complain            10000 non-null  int64  
 15  Satisfaction Score  10000 non-null  int64  
 16  Card Type           10000 non-null  object 
 17  Point Earned        10000 non-null  int64  
dtypes: float64(2), int64(12), object(4)
memory usage: 1.4+ MB

 Table of missing data per column:

RowNumber             0
CustomerId            0
Surname               0
CreditScore           0
Geography             0
Gender                0
Age                   0
Tenure                0
Balance               0
NumOfProducts         0
HasCrCard             0
IsActiveMember        0
EstimatedSalary       0
Exited                0
Complain              0
Satisfaction Score    0
Card Type             0
Point Earned          0
dtype: int64

------------------------------
 
Count of duplicate Rows:

0

 Numerical Columns:
  ['CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary', 'Point Earned'] 

 Categorical Columns:
  ['Geography', 'Gender', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'Exited', 'Complain', 'Satisfaction Score', 'Card Type'] 

 
Updated Feature Types:
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   CreditScore         10000 non-null  int64   
 1   Geography           10000 non-null  category
 2   Gender              10000 non-null  category
 3   Age                 10000 non-null  int64   
 4   Tenure              10000 non-null  int64   
 5   Balance             10000 non-null  float64 
 6   NumOfProducts       10000 non-null  category
 7   HasCrCard           10000 non-null  category
 8   IsActiveMember      10000 non-null  category
 9   EstimatedSalary     10000 non-null  float64 
 10  Exited              10000 non-null  category
 11  Complain            10000 non-null  category
 12  Satisfaction Score  10000 non-null  category
 13  Card Type           10000 non-null  category
 14  Point Earned        10000 non-null  int64   
dtypes: category(9), float64(2), int64(4)
memory usage: 558.1 KB

 Summary Statistis:

 Case when zero balances have been removed from Balance:

 

Case when original Balance is log1p transformed:

+---------+---------+--------------------------+
| Value   |   Count |   Relative Frequency (%) |
|---------+---------+--------------------------|
| France  |    5014 |                     50.1 |
| Germany |    2509 |                     25.1 |
| Spain   |    2477 |                     24.8 |
+---------+---------+--------------------------+

+---------+---------+--------------------------+
| Value   |   Count |   Relative Frequency (%) |
|---------+---------+--------------------------|
| Male    |    5457 |                     54.6 |
| Female  |    4543 |                     45.4 |
+---------+---------+--------------------------+

+---------+---------+--------------------------+
|   Value |   Count |   Relative Frequency (%) |
|---------+---------+--------------------------|
|       1 |    5084 |                     50.8 |
|       2 |    4590 |                     45.9 |
|       3 |     266 |                      2.7 |
|       4 |      60 |                      0.6 |
+---------+---------+--------------------------+

+---------+---------+--------------------------+
|   Value |   Count |   Relative Frequency (%) |
|---------+---------+--------------------------|
|       1 |    7055 |                     70.6 |
|       0 |    2945 |                     29.4 |
+---------+---------+--------------------------+

+---------+---------+--------------------------+
|   Value |   Count |   Relative Frequency (%) |
|---------+---------+--------------------------|
|       1 |    5151 |                     51.5 |
|       0 |    4849 |                     48.5 |
+---------+---------+--------------------------+

+---------+---------+--------------------------+
|   Value |   Count |   Relative Frequency (%) |
|---------+---------+--------------------------|
|       0 |    7962 |                     79.6 |
|       1 |    2038 |                     20.4 |
+---------+---------+--------------------------+

+---------+---------+--------------------------+
|   Value |   Count |   Relative Frequency (%) |
|---------+---------+--------------------------|
|       0 |    7956 |                     79.6 |
|       1 |    2044 |                     20.4 |
+---------+---------+--------------------------+

+---------+---------+--------------------------+
|   Value |   Count |   Relative Frequency (%) |
|---------+---------+--------------------------|
|       3 |    2042 |                     20.4 |
|       2 |    2014 |                     20.1 |
|       4 |    2008 |                     20.1 |
|       5 |    2004 |                     20   |
|       1 |    1932 |                     19.3 |
+---------+---------+--------------------------+

+----------+---------+--------------------------+
| Value    |   Count |   Relative Frequency (%) |
|----------+---------+--------------------------|
| DIAMOND  |    2507 |                     25.1 |
| GOLD     |    2502 |                     25   |
| SILVER   |    2496 |                     25   |
| PLATINUM |    2495 |                     25   |
+----------+---------+--------------------------+

+---------+---------+--------------------------+
|   Value |   Count |   Relative Frequency (%) |
|---------+---------+--------------------------|
|       0 |    6383 |                     63.8 |
|       1 |    3617 |                     36.2 |
+---------+---------+--------------------------+


Count of customer's who complained but didn't churn:  10


Count of customer's who churned but didn't complain:  4

Chi-Square Statistic: 1501.505, p-value: 0.000

Chi-Square Statistic: 300.626, p-value: 0.000

Chi-Square Statistic: 112.397, p-value: 0.000

Chi-Square Statistic: 243.695, p-value: 0.000

Chi-Square Statistic: 149.484, p-value: 0.000

Mann-Whitney U Statistic: 4347741.000, p-value: 0.000

Mann-Whitney U Statistic including zero balance: 6852646.500, p-value: 0.000

Mann-Whitney U Statistic on non-zero balance: 3650896.500, p-value: 0.234

CMH equivalent test (Conditional Independence) Statistic: 8.0775, p-value: 0.004

Breslow-Day equivalent test (Homogeneity) Statistic: 240.5787, p-value: 0.000

Mann-Whitney U Statistic between Age and IsActiveMember: 11914173.000, p-value: 0.000
-------------------------------------------------- 

Aligned Rank Transformation Omnibus Test on for Effects of Exit and IsActiveMember on Age:

                                          sum_sq   df      F  PR(>F)
Main effect: Exited                4218716618.88 1.00 568.87    0.00
Main effect: IsActiveMember          13257084.85 1.00   1.59    0.21
Interaction: Exited:IsActiveMember  193203740.46 1.00  23.29    0.00
-------------------------------------------------- 

Aligned Rank Transformation Contrast Post-Hoc Test for Interaction (Exit:IsActiveMember):

     0:0  0:1  1:0  1:1
0:0 1.00 0.00 0.00 0.00
0:1 0.00 1.00 0.00 0.00
1:0 0.00 0.00 1.00 0.65
1:1 0.00 0.00 0.65 1.00

Conditional independence between age and active status on exited statistic:  911.889, p-value: 0.000 


 --------------------------------------------------

Homogeneity Test between age and active status on exited statistic:  340.270, p-value: 0.000

Kruskal H-Statistic between Age and Number of Products: 189.777,     p-value: 0.000

Dunn's test pairwise interaction comparisons for Age grouped by Number of Products:
     1    2    3    4
1 1.00 0.00 0.00 0.00
2 0.00 1.00 0.00 0.00
3 0.00 0.00 1.00 0.25
4 0.00 0.00 0.25 1.00
-------------------------------------------------- 

Aligned Rank Transformation Omnibus Test on for Effects of Exit and Number of Products on Age:

                                         sum_sq   df      F  PR(>F)
Main effect: Exited               5736627826.87 1.00 768.74    0.00
Main effect: NumOfProducts        1048307984.03 3.00  43.00    0.00
Interaction: Exited:NumOfProducts  175801943.35 3.00   7.12    0.00
-------------------------------------------------- 

Aligned Rank Transformation Contrast Post-Hoc Test for Interaction (Exit:NumOfProducts):

     0:1  0:2  0:3  1:1  1:2  1:3  1:4
0:1 1.00 0.78 1.00 0.00 0.00 0.00 0.00
0:2 0.78 1.00 1.00 0.00 0.00 0.00 0.00
0:3 1.00 1.00 1.00 0.00 0.00 0.00 0.00
1:1 0.00 0.00 0.00 1.00 1.00 1.00 1.00
1:2 0.00 0.00 0.00 1.00 1.00 1.00 1.00
1:3 0.00 0.00 0.00 1.00 1.00 1.00 1.00
1:4 0.00 0.00 0.00 1.00 1.00 1.00 1.00

Chi-Square Statistic: 3.011, p-value: 0.556

-----------Standard correlation plot of transformed data-------------
X_Shape: (8000, 17)
y_Shape: (8000, 1)

-----------Smotenc correlation plot of transformed data-------------
X_Shape: (12740, 17)
y_Shape: (12740, 1)

Model: LogisticRegression
Best Score: 0.99862500
Best Parameters: {'classifier__C': 0.0004, 'classifier__max_iter': 50, 'classifier__penalty': 'l1', 'classifier__solver': 'liblinear'}
------------------------------
Model: SVC
Best Score: 0.99862500
Best Parameters: {'classifier__C': 0.1, 'classifier__max_iter': 100}
------------------------------
Model: RandomForest
Best Score: 0.99862500
Best Parameters: {'classifier__max_depth': None, 'classifier__max_features': 'sqrt', 'classifier__min_samples_split': 4, 'classifier__n_estimators': 100}
------------------------------
Model: AdaBoost
Best Score: 0.99862500
Best Parameters: {'classifier__learning_rate': 0.03, 'classifier__n_estimators': 50}
------------------------------
Model: XGBoost
Best Score: 0.99862500
Best Parameters: {'classifier__alpha': 50, 'classifier__colsample_bytree': 0.3, 'classifier__lambda': 50, 'classifier__learning_rate': 0.03, 'classifier__max_depth': 2, 'classifier__min_child_weight': 5, 'classifier__n_estimators': 50, 'classifier__subsample': 1.0}
------------------------------

Best model is LogisticRegression with Test Accuracy: 0.9985
Average CV metrics summary:

The final estimator in the pipeline does not have 'feature_importances_'.

Feature Coeficients:

Average CV metrics summary:

LogisticRegression Classification Report on Test Set:
+--------------+-------------+----------+------------+-----------+
|              |   precision |   recall |   f1-score |   support |
|--------------+-------------+----------+------------+-----------|
| 0            |       0.999 |    0.999 |      0.999 |  1592.000 |
| 1            |       0.995 |    0.998 |      0.996 |   408.000 |
| accuracy     |       0.999 |    0.999 |      0.999 |     0.999 |
| macro avg    |       0.997 |    0.998 |      0.998 |  2000.000 |
| weighted avg |       0.999 |    0.999 |      0.999 |  2000.000 |
+--------------+-------------+----------+------------+-----------+

['bank_churn_pipeline.pkl']

Category	Column Name	Description
Identifiers	RowNumber	A sequential number for each record.
	CustomerId	A unique identifier for each customer.
	Surname	The customer's last name.
Demographic Information	Geography	The country/region of the customer (e.g., France, Germany, Spain). Location may influence churn behavior.
	Gender	The customer’s gender (Male/Female). May show patterns in churn behavior.
	Age	The customer’s age. Older customers are generally more loyal and less likely to churn.
Financial Information	CreditScore	The customer’s credit score (300–850). Higher scores indicate lower likelihood of churn.
	Balance	The amount in the customer’s account. Higher balances often correlate with lower churn rates.
	EstimatedSalary	The estimated salary of the customer. Higher salaries may indicate financial stability and lower churn.
Customer Engagement	Tenure	The number of years the customer has been with the bank. Longer tenure suggests higher loyalty.
	NumOfProducts	The number of bank products (e.g., accounts, loans) the customer uses. More products may reduce churn likelihood.
	HasCrCard	Whether the customer has a credit card (0 = No, 1 = Yes). Credit card holders are less likely to churn.
	IsActiveMember	Whether the customer is an active user (0 = No, 1 = Yes). Active customers are less likely to leave.
	Card Type	The type of credit card (e.g., Visa, MasterCard). May reflect customer preferences or financial status.
	Points Earned	Points earned from credit card usage. Higher points may indicate satisfaction and lower churn risk.
Customer Feedback	Complain	Whether the customer has filed a complaint (0 = No, 1 = Yes). Complaints are a strong indicator of potential churn.
	Satisfaction Score	The score (likely 1–5) given by the customer for complaint resolution. Impacts churn likelihood.
Target Variable	Exited	Whether the customer left the bank (0 = Stayed, 1 = Left). This is the target variable for prediction.

	CreditScore	Age	Tenure	Balance	EstimatedSalary	Point Earned
count	10000.00	10000.00	10000.00	10000.00	10000.00	10000.00
mean	650.53	38.92	5.01	76485.89	100090.24	606.52
std	96.65	10.49	2.89	62397.41	57510.49	225.92
min	350.00	18.00	0.00	0.00	11.58	119.00
25%	584.00	32.00	3.00	0.00	51002.11	410.00
50%	652.00	37.00	5.00	97198.54	100193.91	605.00
75%	718.00	44.00	7.00	127644.24	149388.25	801.00
max	850.00	92.00	10.00	250898.09	199992.48	1000.00

	Var1	Var2	correlation
0	Geography_Germany	HasZeroBalance_1	-0.44
1	Geography_Germany	HasZeroBalance_0	0.44
2	HasZeroBalance_1	NumOfProducts_2	0.39
3	HasZeroBalance_0	NumOfProducts_2	-0.39
4	HasZeroBalance_1	NumOfProducts_1	-0.39
5	HasZeroBalance_0	NumOfProducts_1	0.39
6	Geography_Germany	Balance	0.37
7	Balance	NumOfProducts_2	-0.34
8	Balance	NumOfProducts_1	0.33
9	Age	Complain_1	0.32
10	Age	Complain_0	-0.32
11	NumOfProducts_2	Complain_0	0.29
12	NumOfProducts_2	Complain_1	-0.29
13	NumOfProducts_3	Complain_1	0.26
14	NumOfProducts_3	Complain_0	-0.26
15	Geography_France	HasZeroBalance_1	0.25
16	Geography_France	HasZeroBalance_0	-0.25
17	Geography_France	Balance	-0.21
18	NumOfProducts_1	Complain_0	-0.18
19	NumOfProducts_1	Complain_1	0.18
20	Geography_Germany	Complain_1	0.18
21	Geography_Germany	Complain_0	-0.18
22	IsActiveMember_1	Complain_0	0.15
23	IsActiveMember_0	Complain_1	0.15
24	IsActiveMember_1	Complain_1	-0.15
25	IsActiveMember_0	Complain_0	-0.15
26	NumOfProducts_4	Complain_0	-0.15
27	NumOfProducts_4	Complain_1	0.15
28	Geography_Spain	HasZeroBalance_1	0.15
29	Geography_Spain	HasZeroBalance_0	-0.15

	Stratum	comparison	p_value_raw	corrected_p_value	significant
0	1	1 vs 0	0.00	0.00	True
1	3	1 vs 0	0.00	0.00	True
2	2	1 vs 0	0.00	0.00	True
3	4	1 vs 0 (Insufficient Data)	1.00	1.00	False

	model	Accuracy	Precision	Recall	F1	F1_std
0	LogisticRegression	0.9986	0.9951	0.9982	0.9966	0.0018
1	SVC	0.9986	0.9951	0.9982	0.9966	0.0018
2	RandomForest	0.9986	0.9951	0.9982	0.9966	0.0018
3	AdaBoost	0.9986	0.9951	0.9982	0.9966	0.0018
4	XGBoost	0.9986	0.9951	0.9982	0.9966	0.0018

	RowNumber	CustomerId	Surname	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Satisfaction Score	Card Type	Point Earned
8086	8087	15774847	Knight	593	France	Male	50	6	171740.69	1	0	0	20893.61	4	SILVER	602
5633	5634	15715877	Lo	821	France	Male	28	2	0.00	2	1	0	46072.52	2	DIAMOND	544
403	404	15732674	Fennell	443	Spain	Male	36	6	70438.01	2	0	1	56937.43	4	GOLD	858
8906	8907	15797065	Goloubev	613	Spain	Female	32	0	0.00	2	0	1	126675.62	5	DIAMOND	963
2060	2061	15747980	Cattaneo	737	Spain	Male	38	6	146282.79	2	1	0	198516.20	5	PLATINUM	703

	feature	importance
5	Complain_1	0.4330
0	Age	0.0000
10	NumOfProd2_X_ZeroBal	0.0000
16	NumOfProd4_X_Age	0.0000
15	NumOfProd3_X_Age	0.0000
14	NumOfProd2_X_Age	0.0000
13	IsActive_X_Age	0.0000
12	NumOfProd4_X_ZeroBal	0.0000
11	NumOfProd3_X_ZeroBal	0.0000
9	NumOfProducts_4	0.0000
1	Geography_Germany	0.0000
8	NumOfProducts_3	0.0000
7	NumOfProducts_2	0.0000
6	HasZeroBalance_1	0.0000
4	IsActiveMember_1	0.0000
3	Gender_Male	0.0000
2	Geography_Spain	0.0000
17	intercept	0.0000

	accuracy_mean	accuracy_std	precision_mean	precision_std	recall_mean	recall_std	f1_mean	f1_std
Complain	0.3222	0.0076	0.7879	0.0187	0.7898	0.0187	0.7888	0.0187
CreditScore	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
Geography	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
Gender	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
Age	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
Tenure	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
Balance	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
NumOfProducts	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
HasCrCard	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
IsActiveMember	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
EstimatedSalary	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
Satisfaction Score	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
Card Type	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
Point Earned	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
HasZeroBalance	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000

Project Outline 📑

Bank Customer Churn Prediction Project¶

Introduction¶

Dataset Description¶

Project Objective¶

Project Workflow¶

Load required libraries, functions, presets, and data¶

Custom Functions¶

1. High Level Data Exploration

Displaying a Random Sample of the Data¶

Data Overview¶

2. Exploratory Data Analysis

2.1 Univariate Analysis

Numerical Features Analysis¶

Let's visualize this information!¶

Univariate Analysis: Numerical Features¶

Key Insights and Preprocessing Recommendations¶

Balance Revisited¶

Categorical Features Analysis¶

Univariate Analysis Insights: Categorical Features¶

Insights: A Narrative of Customer Profiles¶

Preprocessing Plan¶

2.2 Bivariate Analysis: Linking Features to Churn

Correlation Analysis¶

Effect of complaints on curn¶

Product Engagement Effect on Churn¶

Geographical Effects on Churn¶

Gender Effects on Churn¶

Active Members Effects on Churn¶

Zero Balance Members Effects on Churn¶

Effects of Age on Churn¶

Effects of Overall Balance on Churn¶

Linking Features to Churn Summary¶

2.3 Multivariate Analysis

Product Engagement Effects on Zero Balance and Churn¶

Customer Active Status on Age and Churn¶

Product Engagement and Age on Churn¶

Data Anomalies¶

Satisfaction Scores Well Distributed Among Complaining Customers¶

Credit Card Points Earned for Non-Credit Card Owners¶

Multivariate Analysis Conclusion¶

3 Data Preprocessing and Cleaning

4 Model Implementation