At Bounteous, we firmly believe in data-driven decision-making. Who are your customers? Is your customer base growing or declining? Why are they leaving and who are they most likely to leave next? is investing heavily to answer them. The data science team works to transform first-party he data in various formats, from transactions to customer addresses to marketing spend across different platforms and channels, into predictive models that provide actionable insights. is. Additionally, the past decade has seen the rise of data science/machine learning platforms aimed at answering these same questions through an automated, code-free approach.
A 2019 article even posed the famous (or infamous?) question: “Will AutoML replace data scientists?” (look “Death of a data scientist”). As of August 2022, the answer is a resounding “no”. Data-driven consulting and data science solutions are more in demand than ever. So what is it and does your organization need it?
“AutoML” is a reference to the process of automating machine learning tasks and model development. Rather than coding a machine learning model from scratch, there are various code-free options. Cloud platforms such as Google Cloud Platform (GCP), Amazon (AWS), and Microsoft Azure typically have various AutoML capabilities for building, training, and deploying models.
To demonstrate how easy the process of building a model with Google’s AutoML interface is, Bounteous set out to solve a common business problem by building a customer churn model on Google Cloud Platform’s Vertex AI. increase. AutoML” (All future references to “AutoML” will be specific to GCP Vertex AI). If you want to learn more about the process, launch GCP and follow along.In this particular example, Bounteous is public Carrier Churn DatasetThe results are compared with various machine learning classification models built with scikit-learn in Python.
For this use case, customer churn is defined as a customer who chooses to cancel their subscription. The ability to predict which customers are most likely to churn/churn is very important. If you know which customers are most likely to leave quickly, you can employ marketing strategies to keep them. Retaining customers is almost always more cost effective than finding new ones. Identifying potential churners is a great way to increase profitability for any organization. In this article, he discusses one of the various data science modeling techniques that Bounteous utilizes to improve client efficiency and profitability.
The first step in building a churn model in AutoML on GCP Vertex AI is uploading a dataset. GCP allows you to upload both structured and unstructured data. An example of structured data is a CSV file with rows and columns, while unstructured data is a collection of images and text.
Once your dataset is loaded on GCP, you can train your model. In this example of predicting binary outcomes, a classification model is trained.
Under model details, the team specifies the target column as “Churn_Label”. This is a binary yes or no result for this dataset.
The next step is to specify both numeric and categorical fields that the model should use for prediction (listed below).
field name | field type |
---|---|
Agreement | categorical |
dependents | categorical |
Internet service | categorical |
monthly fee | numerical value |
Term month | numerical value |
payment method | categorical |
total charge | categorical |
multiple lines | numerical value |
online security | categorical |
streaming tv | categorical |
technical support | categorical |
online backup | categorical |
paperless billing | categorical |
Partner | categorical |
streaming movie | categorical |
Device protection | categorical |
phone service | categorical |
sex | categorical |
senior citizen | categorical |
This model takes 1-1.5 hours to run.
Below are the costs associated with training a model on Vertex AI.
Once the model is complete, an email is sent to the user and the business can dive into the results.
When it comes to classification algorithms, there is a standard set of metrics used to assess model quality. All of these metrics can be derived from something called a confusion matrix. A confusion matrix is a technique/tool for understanding the effectiveness of a classification system. This is a 2×2 table of actual and predicted results. In the first quadrant, “true positives” indicate positive and predicted outcomes (think actual vs. predicted churn customers), while “true negatives” indicate predicted negatives. Actual negative. Additionally, it includes False Positive and False Negative metrics.
The confusion matrix for the GCP model is shown below.
From the confusion matrix, we can compute two important metrics, precision and recall, to gain better insight into our predictions. They are defined as:
So you may be wondering which metric is more important, accuracy or recall? The answer depends on your purpose. If you’re trying to minimize false positives, you need a model with a high accuracy score. An example of this might be a model that predicts age-appropriate children’s movies, filtering out only the appropriate movies and filtering out all others that are considered inappropriate. On the other hand, if we are trying to minimize false negatives, we need a model that favors strong recall scores. An example of this is a model that detects fraud. The credit card company may prefer a few extra calls to confirm the purchase rather than the actual credit card fraud going unnoticed.
It may be easier to combine precision and recall into one simplified metric (F1 score).
The F1 score is the harmonic mean of precision and recall. In other words, F1 scores produce high F1 scores only when both precision and recall are high. The following metrics are provided from the GCP Vertex AI AutoML churn model.
The ‘confidence threshold’ is the point at which the predicted probability of a customer churn is considered predicted churn or non-churn. In the visual above, a 50% confidence threshold indicates that the model predicts that an individual customer will churn if the predicted probability of churn exceeds her’s 50%. If you want a higher confidence threshold, you need to identify customers who are very likely to churn, so the precision and recall calculations change.
GCP AutoML also allows you to visualize which features in this model are the most important with the feature importance graph (shown below).
Two important features of this model are ‘contracts’ and ‘dependents’. Both are categorical fields and the options are (“Monthly”, “Annual”, “Two Years”) for Contract and (“Yes”, “No”) for Dependents.
The Bounteous Data Science team built six different classification models using various algorithms in Python’s scikit-learn package and determined they were consistent with the predictions of the AutoML model.
accuracy | recall | F1 | |
---|---|---|---|
GCP Vertex AI AutoML | 0.82 | 0.82 | 0.82 |
logistic regression | 0.9 | 0.82 | 0.86 |
Support Vector Machine | 0.92 | 0.75 | 0.83 |
decision tree | 0.72 | 0.76 | 0.74 |
random forest | 0.81 | 0.83 | 0.82 |
Naive Bayes | 0.82 | 0.83 | 0.83 |
K-nearest neighbor | 0.71 | 0.8 | 0.75 |
Precision, Recall, and F1 results are comparable to the Vertex AI AutoML model. Models built with scikit-learn were created using the same variables used in Vertex AI, with no additional feature engineering or parameter tuning. Feature engineering is the process of obtaining new variables from existing variables available in the model. This example is a variable or set of variables that explores the relationship between the two highest-ranking predictors, the contract field and the dependent field. This allows you to answer questions such as “Are customers with a ‘monthly’ contract and ‘yes’ dependents more likely to churn than customers with a ‘annual’ contract and no dependents?” . You can incorporate these types of predictors into your model via feature engineering, a feature not available in GCP Vertex AI AutoML. This is often required for more difficult datasets.
AutoML users also have the option of choosing a more advanced approach by choosing to build a model using the “Custom Training (Advanced)” method for model building. GCP AutoML provides the ability to build models with little expertise required, whereas custom training routes require a more experienced hand in development.
Such a model adds real value to an organization. If you know customers who are more likely to churn, you can take precautions to mitigate this risk. You can encourage business through targeted email/email campaigns or reach out to discuss customer satisfaction and how to improve it via the service line. Retaining customers is more efficient than keeping them.
When it comes to building machine learning models, the details of how the model is built are often problematic when it comes to model quality. AutoML in GCP Vertex AI, like many other codeless machine learning models, is concerned with the details of how the model’s parameters are defined and how the dataset is sampled to build the model. can be something of a black box. While AutoML may be sufficient for your organization’s modeling needs, others require a more advanced approach.
Organizations today have many options for building machine learning capabilities. Bounteous combines GCP Vertex AI capabilities with proprietary algorithms to develop machine learning models specifically for client needs. Custom training models give you complete control over the model building process.
At Bounteous, our team of data scientists and data engineers work with clients to identify and address their most pressing business problems and needs. The datasets used in the models above are ready to model out of the box, but this is often not the case for data science/engineering work. Real-world data most often comes from a variety of sources and formats, and it often takes hours to modify and format the data before it can be uploaded to AutoML and ready to be modeled.
There is no magic data science platform that allows you to upload 20 different CSV files or connect to different data sources associated with different levels of granularity. Examples include weekly, daily, and hourly time series data, retail location, region, and zip code-level geographic data. Common data wrangling tasks, such as knowing when to impute missing values and knowing when null values mean the value 0 in case-by-case scenarios, still require human intervention.
Automated machine learning is not the only solution for the data science life cycle. At Bounteous, these platforms are used as trading tools and employed when it is in the best interests of our clients. Process marketing and e-commerce data on various platforms (Google, Adobe, Meta/Facebook, Amazon, Microsoft, etc.). A large part of Bounteous’ work involves working with clients to identify how to best leverage the insights that data science models can provide, ultimately leading to actionable business decisions. .