Sparkify: Churn Prediction with Spark

Ruben D. Castillo
7 min readApr 12, 2021

--

For subscription-based companies, customer churn is one of the most important metrics to follow. It is defined as the amount of customers that stop using the company’s services. It is important to understand why a customer churn to find the best ways to avoid it and to identify customers whose churn probability is high to take actions to retain it.

In this article we will work with the fictitious streaming music company Sparkify, where, as in Spotify or Pandora, users can have a free account or a paid one. The main purpose is to predict when a customer will cancel it’s subscription to know in advance to make prevent it.

Dataset

The original dataset has 12 GB of data. As this is really large for a local machine, we have a sample of 128 MB.

Each row of the dataset describes an event of a customer in a particular timestamp. There are 18 columns, the ten most important are the following:

  • userId: Id of a certain user
  • sessionId: Id of the session the user was in.
  • song: The song the user was listening to.
  • artist: The artist of that song.
  • page: Page name.
  • level: Is a binary value. It says if the user is a free or paid user in the particular time of the event.
  • ts: The timestamp of the particular event.
  • location: Location of the user.
  • gender: Gender of the user.
  • userAgent: User’s web browser.

Data Exploration

First of all, we have to define the churn from the data we have. To do this, we are going to take the pages in the “Cancellation Confirmation” state in order to create a new column that defines whether or not the user did churn, that we will name as “Churn”.

After cleaning the data for null values, we found that there are 278,154 events, of which only 52 are from Churn.

Churn Instances: Bar plot
Churn instances: Amount

But, what are the other instances? let’s see:

Page

Almost all of the events are “Next Song”.

This does not appear to be a very large Churn value, especially since it is less than 0.02% of the events. However, all these events are generated by 225 clients, of which 52 churn. That is, 23% of the clients of this reduced dataset churn! a very worrying value.

There are multiple factors that could explain this high rate of churn, let’s explore some of them. In the first place, the gender could be an interesting factor that could be related to the churn levels. These are the genders of all of our customers:

Genders of Customers

As we can see, the amount of male customers is bigger, but the dataset is pretty balanced in general terms. Now, let’s check the gender of the churn customers:

Gender of churn customers

There are 50% more males than females in our churn customers, which is a bigger proportion than in the complete database, which means this could be an important factor when we develop our machine learning model.

Another factor that could be important is the location of the client. Let’s first check the location of all 225 customers:

Cities of all customers

Here we can see that the largest number of clients are located in two of the largest cities in the United States of America: New York and Los Angeles. Now let’s check where do the churned customers come from:

We see that again New York and Los Angeles are the cities with more churned customers. This is not very impressive because, as we saw before, these are the cities with the largest amount of customers in Sparkify.

There are some other possible causes. One of them is that the recommendations are bad for the customer that churn. From the data we found out that the amount of thumbs down for churn customers is 496, that is an average of 9.5 thumbs down per customer while they were active. The customers that didn’t churn, had a total of 2050 thumbs down, that means an average of 11.84 thumbs down per customer. Here we can see that the churn customers had less thumbs down than the ones who didn’t churn. It means that bad recommendations not necessarily explain churn.

Another possible cause is that the churn customers didn’t have enough friends in the app, so they preferred one app where they could connect with more known friends. From the data we found out that this could be true, as the amount of friend adds for churn customers is 636, which means 12.23 adds per customer, while the clients that didn’t churn had 3641 friend adds, that is 21.04 adds per customer, that is almost 9 adds per customer more than the churn clients, which could be an important variable for the modeling.

Finally, the last possible reason we checked was that the customers didn’t understand how the app works. This could be measured with the page “Help”. Surprisingly, the amount of times a churned customer pressed the help button was 4.6 per customer in average, while the other customers pressed this button 7 times in average per client.

From this exploration we found that an important factor that could explain the churn of a certain user is the amount of activity in the app, what can be seen in the amount of sessions the customer has in our dataset.

Feature Engineering

We want to build features that help Sparkify and our models to explain if a customer will churn or no. For these, the chosen variables were the following:

  • Hour of the event
  • Gender
  • Amount of thumbs up of a customer in all the dataset.
  • Amount of thumbs down of a customer in all the dataset.
  • Number of sessions the customer had in the dataset.

Pre-Processing

There are five main phases in our pre-processing stage:

  1. Under sampling: Our dataset is extremely unbalanced, so we decided to under sample the majority class. We also needed this phase because the models train was really slow due to our limited processing capacity.
  2. Features merge: We merged all the features we developed before to our under-sampled dataset.
  3. Assembling: We converted our columns into a vector for our modeling phase.
  4. Train/Validation/Test Split: We splitted our data in train (70%), validation (15%) and test (15%)
  5. Scaling: We scaled all of our data using the train dataset to prevent feature leakage that biases our model’s training.

Modeling

We tried three different models:

  1. Logistic Regression
  2. Random Forest Classifier
  3. Gradient Boosting Tree

The evaluation metric was the F1 Score, that let us check the recall as well as the precision in only one metric.

Evaluation

As we made our dataset balanced and smaller, the evaluation metric improved easily for all models. The F1 score for all was 1 for the validation dataset, but we need to have in mind that we used a really small sample of the data as we didn’t have much processing capacity for the models training.

Refinement of the model

After this, as an exercise, we decided to improve our preferred model implementing a grid search. The chosen model: the random forest. We decided to tune two hyperparameters: the maximum depth and the number of trees. The result was 2 trees and a maximum depth of 5. This model had a F1 score of 1 with the test dataset, probably a signal of overfitting.

Conclusions

In this project, we tried to predict if a user of Sparkify will make churn with a subset of the original data. We had 278,154 events of 225 customers, where 52 churn.

We performed an exploratory data analysis of our dataset, created five different features, prepossessed our data and developed different machine learning models, all of these using spark.

With this data science process, Sparkify can define the customers whose churn probability is higher to take actions in advance to prevent them to stop using their services.

Improvement

As we said before, due to the small amount of data used, we are probably overfitting. The solution is to use more data, what means we need to use a cloud instance due to the lack of local processing capacity, and use the complete dataset of 12 GB or at least the 128 MB one. We also could create more variables in the feature engineering stage to improve the performance of our model and understand churn in a better way.

All the code can be found here: https://github.com/rudacaya/sparkify

Thank’s for reading!

--

--

Ruben D. Castillo

Data scientist and electronic engineer, passionate about machine learning, data mining and data analytics.