Predictive Analytics course has taught us various techniques to create models that can predict a particular outcome. This includes creating supervised learning models. Supervised learning models are created when some historical data is used to create the model. Telecom company customer churn prediction is one such application.
In this project, we take up a data set containing 3333 observations of customer churn data of a telecom company. Customer churn means the customer has left the services of this particular telecom company.
The dataset contains 11 variables associated with each of the 3333 observations. One of these 11 variables is Churn itself. So, if the customer has churned, that particular observation will have a churn value of 1, otherwise, it will be 0.
Since this is the variable that out model needs to predict, it is called the dependent variable. But the question is – what is it dependent on? Or, how will the system predict churn?
The answer is that churn will be predicted based on certain other variables, the 10 other variables. These are called independent variables. The image given below will give you an idea of all the 11 variables.
The dataset is observed to have 3333 observations with 11 variables. Upon having a look at the dataset, it is observed that there is a good mix of churn = 0 and churn = 1 observation. This meant that the dataset was balanced and we need not do any undersampling or oversampling of any kind to balance the dataset.
Now here it is clear that our dependent variable is a categorical variable (value of either 0 or 1) whereas the independent variables are either categorical or numerical. Since the dependent variable is categorical, this is a clear case of Classification Problem.
In such a scenario, Logistic Regression came across as the best model building solution.
Further, the following were the steps that we took:
- Since the independent variable is a categorical variable, we had to convert it into a factor. Factor defines levels in categorical variables and the system then sees the outputs as those level.
- Further, it was required to split the data into Train dataset and Test data set. That was achieved using the sample.split function and then taking subsets of Train and Test datasets from the actual dataset.
- The split has been made such that 70% of the dataset is the Train dataset whereas the remaining 30% is the test dataset.
- Further, the model is developed on the basis of the Train dataset. Basically, the Train dataset is used to help the model learn about the interaction between the dependent and independent variable.
- Now, the summary of the model is observed. This is done to check a few things, one of which is the significance of the various independent variables.
- Further, it is observed in the summary that only 3 out of the 10 independent variables are significant. This means only these 3 variables have a significant impact on the customer churn.
- Therefore, we keep aside the model that we just now created and create a new model with just these 3 variables as independent variables. These 3 variables are ContractRenewal, CustServCalls, RoamMins.
- Now, this final model is applied to our Test dataset to check the accuracy and other parameters of the model that we developed.
- The output will give is probabilities. To predict whether there is churn or not we need to set a threshold. We selected the value of the threshold at 0.5. If the value is above 0.5 that means the churn happened otherwise not.
After fitting the results, we need to check how many values of the Test dataset have been correctly predicted by the Logistic model created by us. This is done by creating a cross-tabulation of the actual Test values and the predicted Test values. This is called the confusion matrix.
This confusion matrix suggests that the accuracy of the model is (996+20)/(1212) = 83.83%
This is a good percentage of accuracy. However, we also need to be mindful of the other data given in the output. We see that while the Positive Pred Value is 0.9794 (which is extremely good), the Negative Pred Value is just 0.1026 (which is very bad). This suggests that the probability to correctly predict the true negative is bad.
Therefore, we know that we need to alter the threshold value which we had set in point 9 (methodology). We try different values like 0.3 and 0.25 instead of 0.5 and recheck the results.
For threshold value equal to 0.25, we see a good result of Neg Pred Value increasing to 0.3538. We consider this as our final result. The confusion matrix from this threshold is given below.
Accuracy of the model = 81.93%
Sensitivity = 88.00%