The assignment uncovers the strength of the predictive segmentation of bank customers analysis through which one can predict which segment will a person belong to.
This is, obviously, achieved through some predictors which come from the historical data that the company has. These predictors are independent variables which help us compute the segment. The segment, in our analysis, is the dependent variable.
In the following parts of this document, we would describe the process of carrying out this exercise for the HDFC bank data that was given to us.
Step 1 – Identifying the number of segments
The historic company data provided to us was the response of the customers on a questionnaire. The respondents had to fill in their response based on a Likert Scale where:
1 – Strongly Agree
2 – Agree
3 – Neither Agree nor Disagree
4 – Disagree
5 – Strongly Disagree
The image presents a glimpse of the dataset
In order to identify the number of segments, we ran Hierarchical Cluster analysis. This can be found in Analyze >> Classify >> Hierarchical Cluster
The output presented has a table called the Agglomeration Schedule which tells us about how the various respondents came together, or ‘agglomerated’, and formed clusters. Here is the snippet of the agglomeration schedule so obtained.
From the agglomeration schedule, we could see that there is a big jump in the coefficients between the 14th and the 15th stage. Therefore, we understood that at effective segmentation had been achieved by the 14th stage and that we have 6 identifiable clusters.
The respondents that belonged to each cluster are as given below:
Step 2 – Cluster Profiling
Next, it was required to identify the predictors that differentiate the clusters and hence profile the clusters based on their behavioral responses. This was achieved by K-Means Clustering.
In SPSS, this can be found in Analyze >> Classify >> Hierarchical Cluster
Here, we have to enter the number of clusters that are required. This, as we know, was computed from the Hierarchical clustering as above and it came out to be 6 clusters. That is what was entered.
The output produces an ANOVA table. The last column of this table gives us the Significance value which helps us determine which predictors are important to differentiate between the clusters.
We chose the cut-off significance value of 0.01, which meant a 99% confidence level. Based on that we needed to find out those predictors whose significance is lesser that 0.01. Such variables or predictors will be the significant ones in identifying the differentiators between segments.
Following is the glimpse of that ANOVA table.
From there we picked out the most relevant variables on which the differentiation can be done. As you can see in the table below, those variables are marked green and the Final Clusters table is copied and pasted here in Excel.
As given in the sheet Segment Profiling in our Excel attached with this document, you can see that the inferences for each of the clusters has also been written for these variables that are extremely important in determining the clusters.
As a sample of those inferences is given in the image below.
In that image, you can see that for Cluster 1 we have written our inferences by interpreting their scores. These will help us in determining the choices and the preferences of this cluster which will eventually help us in naming these clusters in the next step.
After having done this for each of the clusters, we gave each of the clusters a name that is given in the sheet title Segment Label in our attached excel.
The we named our segments as
Cluster 1 – Savers, stable and independent
Cluster 2 – Carefree, Spenders
Cluster 3 – Misers, Defaulters, Borrowers
Cluster 4 – Self-sufficient strong savers
Cluster 5 – Mild Savers
Cluster 6 – Daily wager, dependent, poor
This segmentation is correct as it satisfies the conditions for the Wilk’s Lambda and the Eigen Values.
Step 3 – Predictive Segmentation
Refer to our excel sheet titled Predictive Model. There you can see that we picked up the unstandardized Fisher’s coefficients that make up our model. These are available in B4:G18.
Below that table, we have transposed all the 20 individual responses that were given in the data. The segment to which they belong to has also been written.
Now in column I, between I4:I17 we wrote one of the 20 responses. In these cells, any new response will be pasted to check which segment it belongs to.
On it’s right, we have done a tabular multiplication of B4:G17 and I4 to I17 to create the values for each segment from our Fisher’s coefficients.
Now, each column from J to O is summed.
The prediction is in the fact that the sum of whichever Segment is the largest, the responses in column I belong to that Segment.
Step 4 – Cross-checking
We can easily check whether the segment that our model is calling out is correct or not. As we can see, our model is absolutely correct.
Individual Tasks undertaken:
The entire group worked together to make the project a success. However, in different parts some of the members worked more than the others. The data analysis part was covered by Dhiraj and Darpan. A detailed analysis was brought to life by all the members of the group. The report writing was done by Ashish, Shahroz, Darpan, Rishabh and Tanay with reviews done by other members of the group.
Predictive Analytics course has taught us various techniques to create models that can predict a particular outcome. This includes creating supervised learning models. Supervised learning models are created when some historical data is used to create the model. Telecom company customer churn prediction is one such application.
In this project, we take up a data set containing 3333 observations of customer churn data of a telecom company. Customer churn means the customer has left the services of this particular telecom company.
The dataset contains 11 variables associated with each of the 3333 observations. One of these 11 variables is Churn itself. So, if the customer has churned, that particular observation will have a churn value of 1, otherwise, it will be 0.
Since this is the variable that out model needs to predict, it is called the dependent variable. But the question is – what is it dependent on? Or, how will the system predict churn?
The answer is that churn will be predicted based on certain other variables, the 10 other variables. These are called independent variables. The image given below will give you an idea of all the 11 variables.
The dataset is observed to have 3333 observations with 11 variables. Upon having a look at the dataset, it is observed that there is a good mix of churn = 0 and churn = 1 observation. This meant that the dataset was balanced and we need not do any undersampling or oversampling of any kind to balance the dataset.
Now here it is clear that our dependent variable is a categorical variable (value of either 0 or 1) whereas the independent variables are either categorical or numerical. Since the dependent variable is categorical, this is a clear case of Classification Problem.
In such a scenario, Logistic Regression came across as the best model building solution.
Further, the following were the steps that we took:
- Since the independent variable is a categorical variable, we had to convert it into a factor. Factor defines levels in categorical variables and the system then sees the outputs as those level.
- Further, it was required to split the data into Train dataset and Test data set. That was achieved using the sample.split function and then taking subsets of Train and Test datasets from the actual dataset.
- The split has been made such that 70% of the dataset is the Train dataset whereas the remaining 30% is the test dataset.
- Further, the model is developed on the basis of the Train dataset. Basically, the Train dataset is used to help the model learn about the interaction between the dependent and independent variable.
- Now, the summary of the model is observed. This is done to check a few things, one of which is the significance of the various independent variables.
- Further, it is observed in the summary that only 3 out of the 10 independent variables are significant. This means only these 3 variables have a significant impact on the customer churn.
- Therefore, we keep aside the model that we just now created and create a new model with just these 3 variables as independent variables. These 3 variables are ContractRenewal, CustServCalls, RoamMins.
- Now, this final model is applied to our Test dataset to check the accuracy and other parameters of the model that we developed.
- The output will give is probabilities. To predict whether there is churn or not we need to set a threshold. We selected the value of the threshold at 0.5. If the value is above 0.5 that means the churn happened otherwise not.
After fitting the results, we need to check how many values of the Test dataset have been correctly predicted by the Logistic model created by us. This is done by creating a cross-tabulation of the actual Test values and the predicted Test values. This is called the confusion matrix.
This confusion matrix suggests that the accuracy of the model is (996+20)/(1212) = 83.83%
This is a good percentage of accuracy. However, we also need to be mindful of the other data given in the output. We see that while the Positive Pred Value is 0.9794 (which is extremely good), the Negative Pred Value is just 0.1026 (which is very bad). This suggests that the probability to correctly predict the true negative is bad.
Therefore, we know that we need to alter the threshold value which we had set in point 9 (methodology). We try different values like 0.3 and 0.25 instead of 0.5 and recheck the results.
For threshold value equal to 0.25, we see a good result of Neg Pred Value increasing to 0.3538. We consider this as our final result. The confusion matrix from this threshold is given below.
Accuracy of the model = 81.93%
Sensitivity = 88.00%
I will be uploading a lot many of my previous projects in my portfolio soon. Do check out this space soon.