Bank Customer Churn Prediction Project¶
Introduction¶
This project focuses on predicting bank customer churn—whether a customer will leave or stay—using various personal and behavioral attributes. Customer retention is vital for banks, as keeping current customers is significantly more cost-effective than gaining new ones. By identifying the key factors driving churn, banks can develop targeted loyalty and retention strategies to minimize customer turnover. We will build a predictive model using a dataset from Kaggle containing customer information and behavioral data.
Dataset Description¶
The dataset contains 18 columns, each representing a feature of a bank customer. These features include demographic details, financial information, customer engagement, and feedback metrics. Below is a detailed description of the columns, grouped by their relevance for better understanding:
| Category | Column Name | Description |
|---|---|---|
| Identifiers | RowNumber | A sequential number for each record. |
| CustomerId | A unique identifier for each customer. | |
| Surname | The customer's last name. | |
| Demographic Information | Geography | The country/region of the customer (e.g., France, Germany, Spain). Location may influence churn behavior. |
| Gender | The customer’s gender (Male/Female). May show patterns in churn behavior. | |
| Age | The customer’s age. Older customers are generally more loyal and less likely to churn. | |
| Financial Information | CreditScore | The customer’s credit score (300–850). Higher scores indicate lower likelihood of churn. |
| Balance | The amount in the customer’s account. Higher balances often correlate with lower churn rates. | |
| EstimatedSalary | The estimated salary of the customer. Higher salaries may indicate financial stability and lower churn. | |
| Customer Engagement | Tenure | The number of years the customer has been with the bank. Longer tenure suggests higher loyalty. |
| NumOfProducts | The number of bank products (e.g., accounts, loans) the customer uses. More products may reduce churn likelihood. | |
| HasCrCard | Whether the customer has a credit card (0 = No, 1 = Yes). Credit card holders are less likely to churn. | |
| IsActiveMember | Whether the customer is an active user (0 = No, 1 = Yes). Active customers are less likely to leave. | |
| Card Type | The type of credit card (e.g., Visa, MasterCard). May reflect customer preferences or financial status. | |
| Points Earned | Points earned from credit card usage. Higher points may indicate satisfaction and lower churn risk. | |
| Customer Feedback | Complain | Whether the customer has filed a complaint (0 = No, 1 = Yes). Complaints are a strong indicator of potential churn. |
| Satisfaction Score | The score (likely 1–5) given by the customer for complaint resolution. Impacts churn likelihood. | |
| Target Variable | Exited | Whether the customer left the bank (0 = Stayed, 1 = Left). This is the target variable for prediction. |
Project Objective¶
This project's core purpose is to develop an accurate machine learning model to predict customer churn (leaving the bank). By isolating the main churn drivers, such as high complaints, bank balance, or customer age, the bank can implement proactive retention measures like better complaint handling or personalized service offerings.
Project Workflow¶
This project employs a standard data science pipeline to develop a robust predictive model for customer churn (the 'Exited' variable).
- Data Exploration (EDA): Systematically analyze the dataset to understand the underlying structure, relationships, and feature characteristics.
- Univariate Analysis: Profile individual features (e.g., age distribution, balance characteristics).
- Bivariate Analysis: Examine correlations between feature pairs and the target variable (e.g., age vs. churn rate).
- Multivariate Analysis: Investigate complex feature interactions to reveal multi-factor patterns.
- Data Cleaning and Preprocessing: Handle missing values, encode categorical variables (e.g., Geography, Card Type), remove irrelevant columns (e.g., RowNumber, CustomerId, Surname), Apply synthetic data sampling techniques as necessary to balance the dataset, and prepare the data for modeling.
- Predictive Modeling: Construct and train a diverse set of classification algorithms to predict customer exit. Models: Logistic Regression, Support Vector Classifier (SVC), Random Forest, AdaBoost, and XGBoost.
- Evaluation and Recommendations: Rigorously evaluate model efficacy using key classification metrics and identify the most important features driving churn to deliver actionable, data-backed recommendations for the bank's retention strategy.
Load required libraries, functions, presets, and data¶
Data source import complete.
Custom Functions¶
Graphing Functions: plot_distributions, plot_categorical_features, three_way_cat_cont_interactions, corr_plot_func
Three-way Association Regression Functions: log_lin_3_way_cat_association, pairwise_loglin_homogenous_posthoc, pairwise_loglin_simple_heterogenous_posthoc, stratified_categorical_logistic_regression
Three-way Association Non-Parametric Functions: alighed_rank_transform_anova, art_contrast_posthoc_unified
Alternative Three-way Association Regression Functions: stratified_logistic_regression_with_posthoc
ANOVA Functions: cust_one_way_ANOVA, cust_two_way_ANOVA
Non-Parametric ANOVA Functions: kruskal_wallis_with_dunn
Displaying a Random Sample of the Data¶
| RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | Complain | Satisfaction Score | Card Type | Point Earned | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8086 | 8087 | 15774847 | Knight | 593 | France | Male | 50 | 6 | 171740.69 | 1 | 0 | 0 | 20893.61 | 0 | 0 | 4 | SILVER | 602 |
| 5633 | 5634 | 15715877 | Lo | 821 | France | Male | 28 | 2 | 0.00 | 2 | 1 | 0 | 46072.52 | 0 | 0 | 2 | DIAMOND | 544 |
| 403 | 404 | 15732674 | Fennell | 443 | Spain | Male | 36 | 6 | 70438.01 | 2 | 0 | 1 | 56937.43 | 0 | 0 | 4 | GOLD | 858 |
| 8906 | 8907 | 15797065 | Goloubev | 613 | Spain | Female | 32 | 0 | 0.00 | 2 | 0 | 1 | 126675.62 | 0 | 0 | 5 | DIAMOND | 963 |
| 2060 | 2061 | 15747980 | Cattaneo | 737 | Spain | Male | 38 | 6 | 146282.79 | 2 | 1 | 0 | 198516.20 | 0 | 0 | 5 | PLATINUM | 703 |
Data Overview¶
Upon initial inspection, the columns RowNumber, CustomerID, and Surname add no value to predicting Exited in our case and will therefore be removed. The majority of our features consist of integer and float data. Variables such as "Complain", "Exited", etc should be labeled as categorical, we will update this before analysis.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 RowNumber 10000 non-null int64 1 CustomerId 10000 non-null int64 2 Surname 10000 non-null object 3 CreditScore 10000 non-null int64 4 Geography 10000 non-null object 5 Gender 10000 non-null object 6 Age 10000 non-null int64 7 Tenure 10000 non-null int64 8 Balance 10000 non-null float64 9 NumOfProducts 10000 non-null int64 10 HasCrCard 10000 non-null int64 11 IsActiveMember 10000 non-null int64 12 EstimatedSalary 10000 non-null float64 13 Exited 10000 non-null int64 14 Complain 10000 non-null int64 15 Satisfaction Score 10000 non-null int64 16 Card Type 10000 non-null object 17 Point Earned 10000 non-null int64 dtypes: float64(2), int64(12), object(4) memory usage: 1.4+ MB
Fortunatly there are no missing datapoints or duplicates!
Table of missing data per column:
RowNumber 0 CustomerId 0 Surname 0 CreditScore 0 Geography 0 Gender 0 Age 0 Tenure 0 Balance 0 NumOfProducts 0 HasCrCard 0 IsActiveMember 0 EstimatedSalary 0 Exited 0 Complain 0 Satisfaction Score 0 Card Type 0 Point Earned 0 dtype: int64
------------------------------
Count of duplicate Rows:
0
2. Exploratory Data Analysis
We'll start with Exploratory Data Analysis (EDA) to get a clear picture of the dataset, looking for patterns and relationships that influence customer churn. This analysis is broken down into three focused stages:
Univariate Analysis: Examining each feature's individual characteristics (like age or credit score) using statistics and visualizations to spot outliers or skewness.
Bivariate Analysis: Investigating how pairs of features relate to the target variable, Exited (churn), to identify strong predictors using plots and correlation methods.
Multivariate Analysis: Exploring the complex interactions among multiple features to uncover deeper patterns that simpler analyses miss.
The insights gained from this structured EDA will highlight any necessary corrections, like handling imbalanced data or outliers, which we'll address in the Data Cleaning and Preprocessing stage before we build our predictive models.
To begin, we will split the features between numerical and categorical as treatment differs between datatypes in univeriate analysis.
Numerical Columns: ['CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary', 'Point Earned'] Categorical Columns: ['Geography', 'Gender', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'Exited', 'Complain', 'Satisfaction Score', 'Card Type'] Updated Feature Types: <class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CreditScore 10000 non-null int64 1 Geography 10000 non-null category 2 Gender 10000 non-null category 3 Age 10000 non-null int64 4 Tenure 10000 non-null int64 5 Balance 10000 non-null float64 6 NumOfProducts 10000 non-null category 7 HasCrCard 10000 non-null category 8 IsActiveMember 10000 non-null category 9 EstimatedSalary 10000 non-null float64 10 Exited 10000 non-null category 11 Complain 10000 non-null category 12 Satisfaction Score 10000 non-null category 13 Card Type 10000 non-null category 14 Point Earned 10000 non-null int64 dtypes: category(9), float64(2), int64(4) memory usage: 558.1 KB
Numerical Features Analysis¶
Summary Statistis:
| CreditScore | Age | Tenure | Balance | EstimatedSalary | Point Earned | |
|---|---|---|---|---|---|---|
| count | 10000.00 | 10000.00 | 10000.00 | 10000.00 | 10000.00 | 10000.00 |
| mean | 650.53 | 38.92 | 5.01 | 76485.89 | 100090.24 | 606.52 |
| std | 96.65 | 10.49 | 2.89 | 62397.41 | 57510.49 | 225.92 |
| min | 350.00 | 18.00 | 0.00 | 0.00 | 11.58 | 119.00 |
| 25% | 584.00 | 32.00 | 3.00 | 0.00 | 51002.11 | 410.00 |
| 50% | 652.00 | 37.00 | 5.00 | 97198.54 | 100193.91 | 605.00 |
| 75% | 718.00 | 44.00 | 7.00 | 127644.24 | 149388.25 | 801.00 |
| max | 850.00 | 92.00 | 10.00 | 250898.09 | 199992.48 | 1000.00 |
Let's visualize this information!¶
Univariate Analysis: Numerical Features¶
This analysis explores the distributions of the key numerical features—CreditScore, Age, Tenure, Balance, EstimatedSalary, and Point Earned—in the Bank Customer Churn dataset. The goal is to interpret their univariate distributions, identify non-normality or unusual patterns, and propose specific preprocessing strategies for effective model building. We will leverage provided histograms, box plots, and summary statistics, including Q-Q plot $R^2$ values, as evidence.
Key Insights and Preprocessing Recommendations¶
The numerical features offer crucial insights into the bank’s customer base, guiding our feature engineering and modeling plan.
- CreditScore (Mean: 650.53, Median: 652.0, Skewness: -0.07, Kurtosis: -0.43, R²: 0.9941)
Distribution is highly symmetrical and nearly normal (Skewness: -0.07). The high $R^2$ (0.9941) confirms its excellent fit to a normal distribution, with only minor deviations in the lower tail (outliers below 400).- Insight: The majority of customers have healthy credit scores. The few low outliers (below 400) represent high-risk customers, which could be highly predictive of churn.
- Action: No transformation is required. The near-normal distribution is suitable for most linear and distance-based models.
- Insight: The majority of customers have healthy credit scores. The few low outliers (below 400) represent high-risk customers, which could be highly predictive of churn.
- Age (Mean: 38.92, Median: 37.0, Skewness: 1.01, Kurtosis: 1.39, R²: 0.9441)
Distribution is strongly right-skewed (Skewness: 1.01), with a concentration of customers in their 30s and 40s. The low $R^2$ (0.9441) and the long upper tail confirm non-normality.- Insight: The customer base is relatively young. The long tail of older customers (outliers above ≈60) suggests a small but distinct segment. Since younger customers are often more likely to churn, managing this skew will be important for model performance.
- Action: Apply a Log or Yeo-Johnson transformation to reduce the skewness and stabilize the variance, improving the assumption compliance for many predictive models.
- Insight: The customer base is relatively young. The long tail of older customers (outliers above ≈60) suggests a small but distinct segment. Since younger customers are often more likely to churn, managing this skew will be important for model performance.
- Tenure (Mean: 5.01, Median: 5.0, Skewness: 0.01, Kurtosis: -1.17, R²: 0.9489)
Distribution is nearly uniform across the range (0 to 10 years), with a near-zero skewness (0.01). The distribution is inherently discrete, which explains the $R^2$ (0.9489) not being perfect. No outliers were observed.- Insight: Customer loyalty periods are evenly spread, meaning the bank does not have a disproportionate number of very new or very old customers.
- Action: Scaling (StandardScaler or MinMaxScaler) is the primary need, as transformation is unnecessary for a uniform, discrete variable.
- Insight: Customer loyalty periods are evenly spread, meaning the bank does not have a disproportionate number of very new or very old customers.
- Balance (Mean: 76485.89, Median: 97198.54, Skewness: -0.14, Kurtosis: -1.49, R²: 0.8458)
Distribution of this feature is zero-inflated (a large peak at 0) and, for the non-zero portion, it is relatively spread. The median (97.2k) is significantly higher than the mean (76.5k), indicating that the mass of zero-balance accounts is dragging the mean down. The low $R^2$ (0.8458) highlights extreme non-normality.- Insight: The large zero peak is critical: it represents a substantial group of inactive or savings-only customers. The non-zero group is a distinct population (active checking accounts). Modeling this as a single variable is sub-optimal.
- Action: Feature Engineering is essential. Create a new binary variable, "HasZeroBalance", and then, if necessary, transform the non-zero portion of the Balance variable separately.
- Insight: The large zero peak is critical: it represents a substantial group of inactive or savings-only customers. The non-zero group is a distinct population (active checking accounts). Modeling this as a single variable is sub-optimal.
- EstimatedSalary (Mean: 100090.24, Median: 100193.92, Skewness: 0.00, Kurtosis: -1.18, R²: 0.9569)
Distribution is highly uniform across its range (from ≈11.58 to ≈200k), with zero skewness (0.00). The $R^2$ (0.9569) is good for a uniform distribution, and there are no outliers.- Insight: The bank’s customers represent a complete spectrum of income levels, with no particular income bracket dominating the dataset.
- Action: Scaling (StandardScaler) is sufficient. Transformation is unnecessary given its uniformity.
- Insight: The bank’s customers represent a complete spectrum of income levels, with no particular income bracket dominating the dataset.
- Point Earned (Mean: 606.52, Median: 605.0, Skewness: 0.01, Kurtosis: -1.19, R²: 0.9555)
Distribution: Relatively uniform between 119 and 1000, with a slight central tendency (Skewness: 0.01). The $R^2$ (0.9555) is high, showing a well-behaved, near-uniform spread with no outliers.- Insight: Customers show consistent engagement with credit card usage/rewards, suggesting no extreme outliers in point accumulation.
- Action: Scaling (StandardScaler) is the appropriate step to normalize the magnitude without altering the shape.
- Insight: Customers show consistent engagement with credit card usage/rewards, suggesting no extreme outliers in point accumulation.
Balance Revisited¶
When only considering the non-zero balances, the distribution is highly symmetrical (Skewness: 0.03) and exhibits an excellent fit to a normal distribution. The high Q-Q plot $R^2$ of 0.9993 strongly confirms its near-perfect normality.
- Insight: This finding is crucial: the zero-inflated nature was the cause of non-normality in the original variable. By separating the two populations (zero balance vs. non-zero balance), the continuous part of the feature (Balance >0) is now exceptionally well-behaved (see below plots).
- Preprocessing Strategy and Rationale:
- Feature Engineering: A new binary feature, "HasZeroBalance", has been created to capture the effect of an inactive or savings-only account. This addresses the structural issue of zero-inflation.
- Multicollinearity Management: If both the original Balance (including zero balance) and the new "HasZeroBalance" feature are kept in the model, would capture two distinct effects:
- The binary effect: Does having any balance matter?
- The continuous effect: Does the magnitude of the balance matter? This introduces multicollinearity, which will be managed in the modeling phase using regularization techniques (e.g., Lasso or Ridge regression) rather than discarding valuable information. However, there is still review to be done whether both variables will make it to the final model or not.
- Action: Scaling (StandardScaler) of Balance will be suitable. Since this distribution is already virtually normal when not accounting for zeros and lacks significant outliers all together, scaling is preferred over a logarithmic transformation. A log transform would unduly compress the high balances, which, being within a highly symmetrical distribution, are already statistically realistic and informative.
Case when zero balances have been removed from Balance:
Case when original Balance is log1p transformed:
Categorical Features Analysis¶
+---------+---------+--------------------------+ | Value | Count | Relative Frequency (%) | |---------+---------+--------------------------| | France | 5014 | 50.1 | | Germany | 2509 | 25.1 | | Spain | 2477 | 24.8 | +---------+---------+--------------------------+
+---------+---------+--------------------------+ | Value | Count | Relative Frequency (%) | |---------+---------+--------------------------| | Male | 5457 | 54.6 | | Female | 4543 | 45.4 | +---------+---------+--------------------------+
+---------+---------+--------------------------+ | Value | Count | Relative Frequency (%) | |---------+---------+--------------------------| | 1 | 5084 | 50.8 | | 2 | 4590 | 45.9 | | 3 | 266 | 2.7 | | 4 | 60 | 0.6 | +---------+---------+--------------------------+
+---------+---------+--------------------------+ | Value | Count | Relative Frequency (%) | |---------+---------+--------------------------| | 1 | 7055 | 70.6 | | 0 | 2945 | 29.4 | +---------+---------+--------------------------+
+---------+---------+--------------------------+ | Value | Count | Relative Frequency (%) | |---------+---------+--------------------------| | 1 | 5151 | 51.5 | | 0 | 4849 | 48.5 | +---------+---------+--------------------------+
+---------+---------+--------------------------+ | Value | Count | Relative Frequency (%) | |---------+---------+--------------------------| | 0 | 7962 | 79.6 | | 1 | 2038 | 20.4 | +---------+---------+--------------------------+
+---------+---------+--------------------------+ | Value | Count | Relative Frequency (%) | |---------+---------+--------------------------| | 0 | 7956 | 79.6 | | 1 | 2044 | 20.4 | +---------+---------+--------------------------+
+---------+---------+--------------------------+ | Value | Count | Relative Frequency (%) | |---------+---------+--------------------------| | 3 | 2042 | 20.4 | | 2 | 2014 | 20.1 | | 4 | 2008 | 20.1 | | 5 | 2004 | 20 | | 1 | 1932 | 19.3 | +---------+---------+--------------------------+
+----------+---------+--------------------------+ | Value | Count | Relative Frequency (%) | |----------+---------+--------------------------| | DIAMOND | 2507 | 25.1 | | GOLD | 2502 | 25 | | SILVER | 2496 | 25 | | PLATINUM | 2495 | 25 | +----------+---------+--------------------------+
+---------+---------+--------------------------+ | Value | Count | Relative Frequency (%) | |---------+---------+--------------------------| | 0 | 6383 | 63.8 | | 1 | 3617 | 36.2 | +---------+---------+--------------------------+
Univariate Analysis Insights: Categorical Features¶
This section presents the univariate analysis of the Bank Customer Churn dataset's categorical features, revealing critical insights into customer demographics and engagement. Understanding these distributions is vital for shaping our modeling approach and preprocessing strategy.
Insights: A Narrative of Customer Profiles¶
The distribution of the categorical variables highlights specific characteristics that either indicate high churn risk or present modeling challenges due to data imbalance:
- Churn Rate (Exited) and Complaints:
The dataset shows a significant class imbalance, with 7,962 customers (79.6%) retaining the bank (0's) and 2,038 customers (20.4%) churning (1's). This low percentage of churners demands careful handling (like oversampling) in the modeling phase. Crucially, the Complain feature is nearly identical, with 2,044 customers (20.4%) having complaints, confirming that a formal complaint is potentially a powerful direct precursor to attrition.
- Product Engagement (NumOfProducts):
The vast majority of the customer base is focused on low engagement: 5,084 customers (50.8%) have one product and 4,590 (45.9%) have two. Very few use three (266) or four (60). This dominance of low product count suggests limited engagement is a strong risk factor for churn.
- Geographic and Gender Distribution:
- Geography: France dominates the customer base with 5,014 customers (50.1%), while Germany (2,509, 25.1%) and Spain (2,477, 24.8%) are almost evenly split. This significant regional skew suggests that geographic location may heavily influence churn and requires feature encoding.
- Gender: The split is near-balanced, with Males (5,457, 54.6%) slightly outnumbering Females (4,543, 45.4%).
- Loyalty and Activity:
- Active Membership (IsActiveMember): The group is split almost perfectly: 5,151 customers (51.5%) are active versus 4,849 (48.5%) inactive. Inactivity is a clear area of interest for predicting flight risk.
- Credit Card Ownership (HasCrCard): A strong majority of customers, 7,055 (70.5%), hold a credit card, indicating that non-holders (2,945, 29.5%) represent a smaller, but potentially less loyal segment.
- Uniformly Distributed Features:
- Satisfaction Score: Scores are highly uniform across levels (e.g., Score 3: 2,042; Score 5: 1,932), indicating consistent, non-dominant feedback. This even distribution might reduce its individual predictive value.
- Card Type: All four types (DIAMOND: 2,507, GOLD: 2,502, SILVER: 2,496, PLATINUM: 2,495) are almost equally represented (~25% each), suggesting the specific card tier is unlikely to be an independent driver of churn.
Preprocessing Plan¶
Based on these findings, we will move forward with encoding the following nominal features:
- Encoding Categorical Variables: Geography, Gender, and Card Type will be transformed into numerical features using one-hot encoding, as they lack any inherent ordinal relationship.
- Handling Ordinal Variables: Satisfaction Score and NumOfProducts inherently contain rank information (ordinality), however they will also be transformed using one-hot encoding to assess the purely predictive nature of each level.
- Binary Features (Ready to Use): The variables Exited, Complain, HasCrCard, IsActiveMember, and HasZeroBalance are already encoded as binary (0 or 1). No further preprocessing or encoding is required for these features.
Following the univariate insights, the next critical step is the Bivariate Analysis. This phase shifts focus to how individual features interact with our target variable, Exited (Churn). Our primary goal is to move beyond simple distributions and systematically identify the true correlations and dependencies that drive customer attrition. We will use visualization techniques, such as box plots, heatmaps, and contingency tables, alongside calculated correlation coefficients, to uncover the most predictive relationships. The findings here will directly inform our final feature selection and influence subsequent modeling decisions.
Correlation Analysis¶
To clearly identify the factors most strongly associated with customer churn, we segmented our correlation analysis into four strategic categories: Demographics, Financials, Engagement, and Feedback. This approach allows us to easily visualize how variables relate to our target outcome, Exiting (Churn), as well as how variables within each group interact.
We used Spearman's Rank Correlation for this analysis because the underlying data distribution for most variables is non-normal. To include our categorical data, all such variables were converted into binary (0 or 1) indicators.
Our initial analysis focuses on the demographics group: Geography, Gender, and Age. Reviewing the correlation results (specifically the row for "Exited_1" [Churn]), we observe the following key insights regarding customer characteristics:
Age is the strongest demographic driver of churn with a correlation of 0.32. This suggests that older customers are significantly more likely to exit.
Geography is the second most correlated factor. Customers located in Germany show a higher association with exiting compared to customers in other regions within our dataset.
Gender also shows a measurable correlation. Specifically, the data indicates a positive correlation with the female segment and a negative correlation with the male segment. This suggests that female customers are disproportionately associated with the decision to churn in our sample.
This early segmentation helps us prioritize which customer profiles (older, female, German-based) require immediate attention and targeted retention strategies.
Next, we examined the customer financial data group, which includes Estimated Salary, Account Balance (continuous), Has Zero Balance (binary), and Credit Score.
Looking at the correlation results with the Exited (Churn) outcome, we find that the variables in this group generally show weak correlations with churn.
The highest, albeit still low, correlations are observed with Account Balance (continuous) and the Has Zero Balance indicator.
Balance as a continuous variable shows a positive correlation with churn, meaning customers with larger account balances are slightly more associated with exiting.
The binary variable Has Zero Balance shows a negative correlation. This is logical: since the binary variable is coded '1' for a zero balance, the negative correlation indicates that customers with zero balances are less associated with exiting compared to those who hold money with the bank.
Given the similar, small magnitude of correlation for these two Balance-related variables, they essentially provide redundant information regarding the likelihood of churn. Overall, the financial data, outside of Balance itself, is not a primary driver of churn in this model.
We next analyzed customer engagement data, including Tenure, Points Earned, Number of Products, Credit Card Ownership, Active Member Status, and Card Type.
Looking at the correlation results with Exited (Churn), the key drivers are Number of Products and Active Member Status.
Number of Products shows the most significant correlation within this group. Intriguingly, customers holding two products are more likely to be retained (less likely to exit) compared to those with one, three, or four products, all of which show a higher association with churn. This finding highlights a specific retention sweet spot at the two-product level and warrants deeper investigation.
Active Member Status shows the expected negative correlation with churn. This confirms that customers who are flagged as Active Members are less likely to exit compared to those who are inactive.
Other variables like Tenure, Points Earned, and Credit Card Ownership showed negligible correlation with the churn outcome. This suggests that retention efforts should heavily prioritize increasing active member rates and understanding the dynamics around product bundling.
Finally, we examined the customer feedback data, which includes a Complaint indicator (did the customer complain?) and their Satisfaction Score regarding complaint resolution.
Our analysis of the correlation with Exited (Churn) reveals a significant disparity:
We found no discernible correlation between a customer's Satisfaction Score regarding issue resolution and their decision to exit. This suggests that simply resolving a complaint may not prevent churn if the underlying issue is severe.
In stark contrast, we observe an extremely strong, almost perfect correlation between customers who complain and those who subsequently exit.
This near-perfect correlation between filing a complaint and churning is an extreme finding that clearly identifies complaints as a critical and immediate precursor to customer loss. This warrants immediate and deeper investigation.
After analyzing correlations within each segment, we ran all inter-group correlations and identified the top 30 strongest relationships (by absolute value) across the entire dataset.
The variables that demonstrate the highest and most frequent correlations across different business categories are: Geographical Location, Number of Products, Has Zero Balance, Complaint Indicator, Active Member Status, and Age.
This prioritized list of six variables will be critical. It confirms that the most influential variables regarding customer behavior are not isolated; they frequently interact across the Demographics, Financials, and Engagement categories. We will use these top cross-category correlations to guide the next phase of our analysis, focusing on how these relationships collectively drive the decision to exit.
| Var1 | Var2 | correlation | |
|---|---|---|---|
| 0 | Geography_Germany | HasZeroBalance_1 | -0.44 |
| 1 | Geography_Germany | HasZeroBalance_0 | 0.44 |
| 2 | HasZeroBalance_1 | NumOfProducts_2 | 0.39 |
| 3 | HasZeroBalance_0 | NumOfProducts_2 | -0.39 |
| 4 | HasZeroBalance_1 | NumOfProducts_1 | -0.39 |
| 5 | HasZeroBalance_0 | NumOfProducts_1 | 0.39 |
| 6 | Geography_Germany | Balance | 0.37 |
| 7 | Balance | NumOfProducts_2 | -0.34 |
| 8 | Balance | NumOfProducts_1 | 0.33 |
| 9 | Age | Complain_1 | 0.32 |
| 10 | Age | Complain_0 | -0.32 |
| 11 | NumOfProducts_2 | Complain_0 | 0.29 |
| 12 | NumOfProducts_2 | Complain_1 | -0.29 |
| 13 | NumOfProducts_3 | Complain_1 | 0.26 |
| 14 | NumOfProducts_3 | Complain_0 | -0.26 |
| 15 | Geography_France | HasZeroBalance_1 | 0.25 |
| 16 | Geography_France | HasZeroBalance_0 | -0.25 |
| 17 | Geography_France | Balance | -0.21 |
| 18 | NumOfProducts_1 | Complain_0 | -0.18 |
| 19 | NumOfProducts_1 | Complain_1 | 0.18 |
| 20 | Geography_Germany | Complain_1 | 0.18 |
| 21 | Geography_Germany | Complain_0 | -0.18 |
| 22 | IsActiveMember_1 | Complain_0 | 0.15 |
| 23 | IsActiveMember_0 | Complain_1 | 0.15 |
| 24 | IsActiveMember_1 | Complain_1 | -0.15 |
| 25 | IsActiveMember_0 | Complain_0 | -0.15 |
| 26 | NumOfProducts_4 | Complain_0 | -0.15 |
| 27 | NumOfProducts_4 | Complain_1 | 0.15 |
| 28 | Geography_Spain | HasZeroBalance_1 | 0.15 |
| 29 | Geography_Spain | HasZeroBalance_0 | -0.15 |
Effect of complaints on curn¶
The Complain variable has demonstrated a near-perfect predictive relationship with churn. Almost every customer who exited the bank ($\text{Churn}=1$) first filed a complaint ($\text{Complain}=1$). Conversely, scarcely any non-complaining customers exited. This makes Complain an exceptionally powerful predictor.
While highly predictive, this near-perfect relationship introduces a technical problem known as quasi-complete separation . This statistical instability causes the estimated coefficients in models like logistic regression (and their associated standard errors) to become unstable, inflated, or mathematically unreliable.
Our primary objective is to predict churn among the existing customer base. Given this goal, the Complain variable is an indispensable, early indicator of risk. We will therefore retain this variable, treating customers who complain as a high-threat cohort on the verge of attrition.
- Note on Alternative Goals: If the modeling goal were to predict the churn propensity of a new customer (before any service interaction), we would be forced to remove Complain, as its value would be unknown at the time of prediction. Since we are modeling current customer risk, retention is justified.
To manage the statistical instability caused by the separation, we will apply regularization techniques (such as L1 or L2 penalties) to stabilize the model's coefficient estimates. Our current analytical focus shifts to the small, crucial group of 10 customers who complained but ultimately chose to stay and the 4 customers who churned but did not have a complaint. Identifying the unique characteristics and mitigating factors that prevented their churn is now our top priority, as these insights are invaluable for developing effective retention strategies.
Count of customer's who complained but didn't churn: 10 Count of customer's who churned but didn't complain: 4
Product Engagement Effect on Churn¶
The number of products a customer holds is a strong predictor of churn volatility. The sweet spot for retention is two products ($\sim 6\%$ churn). Beyond this, customers become high risk. Three products jump to $\sim 81\%$ churn, and four products result in near-total loss.
Action: This requires immediate investigation by a dedicated task force. Current pricing, bundling, or service model is driving away the most engaged customers.
Chi-Square Statistic: 1501.505, p-value: 0.000
Geographical Effects on Churn¶
The geographical distribution of churn is statistically significant. While France and Spain successfully retained the highest proportion of customers (about 17%), Germany saw an absolute number of churned customers equal to France, representing a 1:3 churn-to-retention ratio.
Chi-Square Statistic: 300.626, p-value: 0.000
Gender Effects on Churn¶
A statistically significant disparity exists in churn rates between genders. Women churn at a rate of approximately 25%, which is substantially higher than the roughly 15% churn rate observed for men. This significant finding holds true even though the sample slightly favors men (5% more men than women).
Chi-Square Statistic: 112.397, p-value: 0.000
Active Members Effects on Churn¶
Customer activity has a statistically significant effect on churn: Inactive members churn at a rate of nearly 25%, which is dramatically higher than the $\approx 15\%$ rate seen among active members.
Chi-Square Statistic: 243.695, p-value: 0.000
Zero Balance Members Effects on Churn¶
A statistically significant disparity exists in churn based on account balance. Customers with a positive balance churn at a much higher rate (around 24%) than those with a zero balance (about 14%). This result is counter-intuitive, as it suggests that holding a balance is associated with a greater likelihood of exiting, despite the prior finding that inactive customers (who often have zero balances) are typically higher risk. Unfortunately we do not have the right data to dive deeper into this question.
Chi-Square Statistic: 149.484, p-value: 0.000
Effects of Age on Churn¶
The analysis of Age reveals a substantial difference in distributions between customers who exited (1) and those who did not (0), establishing age as a highly significant predictor of churn.
Specifically, the majority of active (non-exited) customers are concentrated in a younger demographic, with their core age range lying between 30 and 41 years old. In sharp contrast, customers who have churned are notably older, with their core age range spanning 38 to 51 years. This upward shift in age for the Exited group strongly suggests that older customers are at a significantly higher risk of attrition.
The disparity in the median ages between the two cohorts is also statistically significant, further confirming that this relationship is not due to random chance and solidifying Age as a critical factor in predicting customer departure.
Mann-Whitney U Statistic: 4347741.000, p-value: 0.000
Effects of Overall Balance on Churn¶
An analysis of the balance distributions for customers with a positive balance (for example, by looking at histograms separated by churn status) reveals a key insight: the density distributions for customers who churned and those who did not are remarkably similar. This finding suggests that holding a very high positive balance, compared to a moderate positive balance, does not inherently change a customer's likelihood of leaving the service.
We must be careful when testing the statistical significance of the raw Balance variable (the continuous variable) across its entire range. The distribution of non-churned customers has a significantly higher proportion of zero-balance accounts than the distribution of churned customers. If we test for significance using the entire balance range (including zeros), we will get a statistically significant result. However, this significance is misleading; it's entirely driven by the mean rank difference between zero-balance and positive-balance accounts, not by the variation within the positive balances. To correctly isolate the effect of the continuous balance, we must test for significance only on the subset of customers with a positive balance. When we do this, the positive balance is not a statistically significant predictor of churn. The observed statistical difference when using the full variable is wholly attributable to the overwhelming disparity in the proportion of non-exited customers who maintain a zero balance.
Recommendation:
Because the predictive power is concentrated solely at the distinction between a zero balance and a positive balance, the continuous raw Balance variable is not a useful predictor on its own. The binary feature HasZeroBalance (a value of 1 for zero balance, 0 for positive balance) captures all the necessary information more effectively and parsimoniously. We should rely on the binary HasZeroBalance feature and potentially exclude the raw continuous Balance variable from the final predictive model. This simplifies the model without sacrificing predictive accuracy.
Mann-Whitney U Statistic including zero balance: 6852646.500, p-value: 0.000 Mann-Whitney U Statistic on non-zero balance: 3650896.500, p-value: 0.234
Linking Features to Churn Summary¶
Our initial screening, based on separation by churn status, has identified a robust set of statistically significant predictors of customer attrition. These key drivers include the demographic variables Age and Gender, the behavioral indicators Number of Products, Activeness Status, and whether a Complaint was filed, along with the Geographical location.
Crucially, our feature engineering effort established that the HasZeroBalance binary variable is analytically superior to the raw, continuous Balance variable. The binary feature will be retained due to its clarity and greater predictive efficacy.
Path Forward: Multivariate Analysis
The next phase of this study will transition to multivariate analysis. We will focus on exploring the complex relationships among these identified predictor variables. Specifically, we will systematically investigate potential interaction effects, test for mediation (where one variable's effect is explained by a second), and identify any confounding variables that may distort the true relationships between our key predictors and customer churn. This step is essential for building a final predictive model that is both accurate and rigorously interpreted.
Product Engagement Effects on Zero Balance and Churn¶
The analysis confirms a statistically significant interaction effect involving the zero balance status. The effect of a Zero Balance on churn is not uniform; it changes significantly depending on the number of products a customer holds.
Recomendation: We must include the explicit interaction term ($\text{HasZeroBalance} \times \text{Number of Products}$) in the model to accurately capture how these two factors combine to influence churn.
CMH equivalent test (Conditional Independence) Statistic: 8.0775, p-value: 0.004 Breslow-Day equivalent test (Homogeneity) Statistic: 240.5787, p-value: 0.000
| Stratum | comparison | p_value_raw | corrected_p_value | significant | |
|---|---|---|---|---|---|
| 0 | 1 | 1 vs 0 | 0.00 | 0.00 | True |
| 1 | 3 | 1 vs 0 | 0.00 | 0.00 | True |
| 2 | 2 | 1 vs 0 | 0.00 | 0.00 | True |
| 3 | 4 | 1 vs 0 (Insufficient Data) | 1.00 | 1.00 | False |
Customer Active Status on Age and Churn¶
We analyzed the relationship between Age, Churn (Exited), and Active Status (IsActiveMember) to refine the predictive models. We confirmed that both Age and Active Status are individually significant predictors of Churn. I used a robust, nonparametric Aligned Rank Transform ($\text{ART}$) $\text{ANOVA}$ (with Age as the dependent variable) to test the complex relationship between the two categorical predictors. The $\text{Churn} \times \text{IsActiveMember}$ interaction effect on the Age distribution was highly significant. This means that the combined influence of being active/inactive and having churned/not churned results in dramatically different age profiles. Post-hoc testing revealed that differences in average age distributions exist across almost all combinations of Churn and Active Status. The only exception was the comparison between inactive members who churned ($\text{Exit}=1, \text{IsActive}=0$) and active members who churned ($\text{Exit}=1, \text{IsActive}=1$). Their age distributions were statistically similar. This complex pattern confirms that Active Status doesn't just predict churn; it moderates how Age is linked to churn risk.
Recommendation: Immediately add an interaction variable ($\text{Age} \times \text{IsActiveMember}$) to predictive churn models. This step is essential to capture the full moderating effect and prevent misestimating the churn risk associated with age for different member segments.
Mann-Whitney U Statistic between Age and IsActiveMember: 11914173.000, p-value: 0.000
--------------------------------------------------
Aligned Rank Transformation Omnibus Test on for Effects of Exit and IsActiveMember on Age:
sum_sq df F PR(>F)
Main effect: Exited 4218716618.88 1.00 568.87 0.00
Main effect: IsActiveMember 13257084.85 1.00 1.59 0.21
Interaction: Exited:IsActiveMember 193203740.46 1.00 23.29 0.00
--------------------------------------------------
Aligned Rank Transformation Contrast Post-Hoc Test for Interaction (Exit:IsActiveMember):
0:0 0:1 1:0 1:1
0:0 1.00 0.00 0.00 0.00
0:1 0.00 1.00 0.00 0.00
1:0 0.00 0.00 1.00 0.65
1:1 0.00 0.00 0.65 1.00
Conditional independence between age and active status on exited statistic: 911.889, p-value: 0.000 -------------------------------------------------- Homogeneity Test between age and active status on exited statistic: 340.270, p-value: 0.000
Product Engagement and Age on Churn¶
The data reveals a clear relationship between average customer age and the number of products they hold, particularly concerning churn risk. Customers with only one product are, on average, the oldest across both the churned and non-churned groups. A notable interaction exists between age and product count: two-product customers who churned were older than those with three products who churned. Statistical analysis (ART ANOVA) confirmed the significant interaction between age and the number of products customer holds when Exited=1.
Recomendation: Include the Age $\times$ Number of Products interaction term in all predictive churn models to improve accuracy and targeting.
Kruskal H-Statistic between Age and Number of Products: 189.777, p-value: 0.000
Dunn's test pairwise interaction comparisons for Age grouped by Number of Products:
1 2 3 4
1 1.00 0.00 0.00 0.00
2 0.00 1.00 0.00 0.00
3 0.00 0.00 1.00 0.25
4 0.00 0.00 0.25 1.00
--------------------------------------------------
Aligned Rank Transformation Omnibus Test on for Effects of Exit and Number of Products on Age:
sum_sq df F PR(>F)
Main effect: Exited 5736627826.87 1.00 768.74 0.00
Main effect: NumOfProducts 1048307984.03 3.00 43.00 0.00
Interaction: Exited:NumOfProducts 175801943.35 3.00 7.12 0.00
--------------------------------------------------
Aligned Rank Transformation Contrast Post-Hoc Test for Interaction (Exit:NumOfProducts):
0:1 0:2 0:3 1:1 1:2 1:3 1:4
0:1 1.00 0.78 1.00 0.00 0.00 0.00 0.00
0:2 0.78 1.00 1.00 0.00 0.00 0.00 0.00
0:3 1.00 1.00 1.00 0.00 0.00 0.00 0.00
1:1 0.00 0.00 0.00 1.00 1.00 1.00 1.00
1:2 0.00 0.00 0.00 1.00 1.00 1.00 1.00
1:3 0.00 0.00 0.00 1.00 1.00 1.00 1.00
1:4 0.00 0.00 0.00 1.00 1.00 1.00 1.00
Data Anomalies¶
Satisfaction Scores Well Distributed Among Complaining Customers¶
The bulk of customer churn originated from the population of customers who lodged a complaint. Analysis of complaint resolution satisfaction scores shows no significant difference between complaining and non-complaining groups. Expected concentration of low scores (1-3) among complainants was not observed. The complaint satisfaction scoring system is ineffective at providing actionable insight into churn propensity. It requires review and redesign to establish a reliable metric for identifying high-risk customers.
Chi-Square Statistic: 3.011, p-value: 0.556
Credit Card Points Earned for Non-Credit Card Owners¶
The "Points Earned" variable is misleading. Despite documentation linking points exclusively to credit card usage, all customers—including non-cardholders—have points, and the point spread is statistically identical across all key customer segments (card ownership, card type, tenure, active status). One must immediately verify the data source and correct the variable definition, as the current data does not reflect expected disparities based on loyalty or premium card usage, making it useless for segmentation of churn.
Multivariate Analysis Conclusion¶
We are optimizing the churn prediction model by introducing three new interaction variables:
- Products $\times$ Zero Balance
- Active Member $\times$ Age
- Products $\times$ Age
To maintain focus and efficiency, we are dropping six non-predictive variables (Credit Score, Estimated Salary, Tenure, Has Credit Card, Points Earned, and Satisfaction Score) from the prediction model. While these variables don't predict churn, they are valuable for future effect analysis on prevention strategies.
Our next critical step is to consolidate and prepare the final dataset before building our predictive model. We will follow a standard, multi-stage pipeline to ensure the resulting model is both highly accurate and robust across all customer types.
We will begin by splitting our data into training and testing sets. Due to the strategic importance of customer feedback, we are taking an extra step to stratify this split, ensuring both our training and final testing environments contain a representative sample of all key customer segments: those who complain and those who churn. This prevents bias and ensures a reliable final test. We will use cross-validation (multiple tests) instead of a single validation set to further guarantee the model's consistent performance.
Based on our analysis, we are using seven key variables (Age, Geography, Gender, Active Member status, Complaint status, Zero Balance status, and Number of Products) to predict customer churn (Exited).
The Data Preparation Pipeline:
We will execute a three-phase preparation process:
Standardization and Outlier Cleanup:
We will clean up extreme outliers and ensure all numerical data is on the same scale, which is essential for the model to fairly weigh the impact of each factor (e.g., Age versus Number of Products).Risk Balancing & New Insight Creation:
To combat the issue of our model only seeing a small percentage of customers who actually churn, we are using an advanced technique (SMOTENC) to rebalance the dataset. This ensures the model is highly effective at identifying the small but critical population of high-risk customers.
We will create new high-impact interaction terms (e.g., the combined risk of 'Number of Products' and 'Zero Balance') that proved powerful in our initial analysis.
Final Modeling Feed:
We will complete the data encoding and feed the highly prepared, balanced dataset into the classification algorithm to begin model building in the next section.
These steps ensure we deliver a model that provides actionable, reliable predictions for the business.
Specific Pipeline Steps:
- Winsorize
- Transform
- Scale
- SMOTENC
- One-hot-encoding
- Create interaction terms
- Classifier
Let's take a look at what are correlation structure looks like now with our training set before and after SMOTENC
-----------Standard correlation plot of transformed data------------- X_Shape: (8000, 17) y_Shape: (8000, 1)
-----------Smotenc correlation plot of transformed data------------- X_Shape: (12740, 17) y_Shape: (12740, 1)
We conducted a comprehensive review of several advanced modeling techniques—including complex methods like Random Forest and XG Boost—to develop the most reliable and efficient predictor of customer churn.
Core Findings & Model Selection
Exceptional Performance: After rigorous testing and tuning, all five candidate models delivered similar, excellent performance, achieving a high prediction accuracy of 99% on new, unseen customer data.
Simplicity Wins: Given the near-identical high performance across the board, we selected the simplest, most streamlined model: Logistic Regression. This choice minimizes implementation cost, maximizes speed, and makes the model easy to explain and maintain without sacrificing predictive power.
The Key Driver: Our core finding is that the final model relies almost exclusively on one critical variable: whether or not a customer registers a formal complaint.
Strategic Gaps and Future Focus
While 99% accuracy is an outstanding statistical result, focusing on the single "Complain" variable reveals two important strategic gaps that we are currently addressing:
Silent Churners (Missed Opportunities): The current model cannot effectively predict customers who leave without ever complaining (customers who "ghost" us). These represent silent attrition that we need new signals to catch.
False Alarms (Wasted Resources): The model also struggles with customers who complain but ultimately decide to stay. If we focus intervention on every complaint, we risk over-allocating resources to customers who were never truly at risk of leaving.
In short: our model successfully confirmed that a complaint is a very strong signal for impending churn, but it doesn't solve the problem of unannounced churn.
Strategic Value and Next Steps
The model successfully answers the core problem: providing a reliable predictor of customer exit behavior based on a simple, actionable signal.
Our deep-dive analysis also revealed valuable strategic insights for the business:
Data Quality: We identified potential holes in our data collection that, if addressed, could unlock even stronger predictive drivers.
Product Strategy: The analysis hints at underlying issues related to product tiering or pricing structures that may be driving the complaint/churn cycle.
Our next future action is to build an alternative predictive model that deliberately omits the "Complain" variable to uncover these other, less obvious drivers of churn and provide a more diverse set of early warning signals.
Model: LogisticRegression
Best Score: 0.99862500
Best Parameters: {'classifier__C': 0.0004, 'classifier__max_iter': 50, 'classifier__penalty': 'l1', 'classifier__solver': 'liblinear'}
------------------------------
Model: SVC
Best Score: 0.99862500
Best Parameters: {'classifier__C': 0.1, 'classifier__max_iter': 100}
------------------------------
Model: RandomForest
Best Score: 0.99862500
Best Parameters: {'classifier__max_depth': None, 'classifier__max_features': 'sqrt', 'classifier__min_samples_split': 4, 'classifier__n_estimators': 100}
------------------------------
Model: AdaBoost
Best Score: 0.99862500
Best Parameters: {'classifier__learning_rate': 0.03, 'classifier__n_estimators': 50}
------------------------------
Model: XGBoost
Best Score: 0.99862500
Best Parameters: {'classifier__alpha': 50, 'classifier__colsample_bytree': 0.3, 'classifier__lambda': 50, 'classifier__learning_rate': 0.03, 'classifier__max_depth': 2, 'classifier__min_child_weight': 5, 'classifier__n_estimators': 50, 'classifier__subsample': 1.0}
------------------------------
Best model is LogisticRegression with Test Accuracy: 0.9985 Average CV metrics summary:
| model | Accuracy | Precision | Recall | F1 | F1_std | |
|---|---|---|---|---|---|---|
| 0 | LogisticRegression | 0.9986 | 0.9951 | 0.9982 | 0.9966 | 0.0018 |
| 1 | SVC | 0.9986 | 0.9951 | 0.9982 | 0.9966 | 0.0018 |
| 2 | RandomForest | 0.9986 | 0.9951 | 0.9982 | 0.9966 | 0.0018 |
| 3 | AdaBoost | 0.9986 | 0.9951 | 0.9982 | 0.9966 | 0.0018 |
| 4 | XGBoost | 0.9986 | 0.9951 | 0.9982 | 0.9966 | 0.0018 |
The final estimator in the pipeline does not have 'feature_importances_'. Feature Coeficients:
| feature | importance | |
|---|---|---|
| 5 | Complain_1 | 0.4330 |
| 0 | Age | 0.0000 |
| 10 | NumOfProd2_X_ZeroBal | 0.0000 |
| 16 | NumOfProd4_X_Age | 0.0000 |
| 15 | NumOfProd3_X_Age | 0.0000 |
| 14 | NumOfProd2_X_Age | 0.0000 |
| 13 | IsActive_X_Age | 0.0000 |
| 12 | NumOfProd4_X_ZeroBal | 0.0000 |
| 11 | NumOfProd3_X_ZeroBal | 0.0000 |
| 9 | NumOfProducts_4 | 0.0000 |
| 1 | Geography_Germany | 0.0000 |
| 8 | NumOfProducts_3 | 0.0000 |
| 7 | NumOfProducts_2 | 0.0000 |
| 6 | HasZeroBalance_1 | 0.0000 |
| 4 | IsActiveMember_1 | 0.0000 |
| 3 | Gender_Male | 0.0000 |
| 2 | Geography_Spain | 0.0000 |
| 17 | intercept | 0.0000 |
Average CV metrics summary:
| accuracy_mean | accuracy_std | precision_mean | precision_std | recall_mean | recall_std | f1_mean | f1_std | |
|---|---|---|---|---|---|---|---|---|
| Complain | 0.3222 | 0.0076 | 0.7879 | 0.0187 | 0.7898 | 0.0187 | 0.7888 | 0.0187 |
| CreditScore | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Geography | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Gender | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Age | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Tenure | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Balance | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| NumOfProducts | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| HasCrCard | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| IsActiveMember | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| EstimatedSalary | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Satisfaction Score | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Card Type | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Point Earned | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| HasZeroBalance | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
LogisticRegression Classification Report on Test Set: +--------------+-------------+----------+------------+-----------+ | | precision | recall | f1-score | support | |--------------+-------------+----------+------------+-----------| | 0 | 0.999 | 0.999 | 0.999 | 1592.000 | | 1 | 0.995 | 0.998 | 0.996 | 408.000 | | accuracy | 0.999 | 0.999 | 0.999 | 0.999 | | macro avg | 0.997 | 0.998 | 0.998 | 2000.000 | | weighted avg | 0.999 | 0.999 | 0.999 | 2000.000 | +--------------+-------------+----------+------------+-----------+
Time to save our model!
['bank_churn_pipeline.pkl']
If you would like to see my other works, you can visit:
My website
My GitHub
My Kaggle
Direct link to this file repository on my Github: Bank Churn Prediction.