NLU classification and auto ML

Rahul Agarwal
6 min readMay 2, 2021

--

Getting started from an excellent intro from Charlie Flanagan and his Machine Learning for Business class here is some experimentation with my own models, Google AutoML and AWS SageMaker.

Problem

Product reviews from a women’s clothing ecommerce store are provided. Based on each review it has been classified as whether the reviewer would recommend that product or not. Along with each review some additional features are also provided. The goal is to create a model that can predict the likelihood of product recommendation given the customer review and some additional features.

My Attempt

Using Colab and following Charlie’s example.

Step1 — Data and Model

The raw data is something like this table. Note the many empty columns so needs some cleanup. See this notebook for various things done to cleanup the data.

Example raw data

The file ecommerce-reviews-full-set.csv created is what is used further in my code. Empty cells are a big problem and for missing titles which are a lot, I copied them over into the review text itself so then there is only 1 text feature.

data['Title_Review'] = data.Title.astype(str).str.cat(data['Review_Text'].astype(str), sep=' ')

The next steps are in this second notebook. The text is cleaned up removing non A-Z, converting to lowercase, removing stopwords, lemmatization and finally creating the count vector.

Step 2 — Results

Using the validation data in all cases. Classification attempt with logistic regression, AUC 0.93 and weighted F1 0.90.

Logistic Regression classification
Logistic Regression classification

Classification attempt with Naive Bayes, AUC 0.94 and weighted F1 0.90.

Naive Bayes classification
Naive Bayes classification

Classification attempt with XGBoost, AUC 0.91 and weighted F1 0.84.

XGBoost classification
XGBoost classification

Finally using the test data set Naive Bayes looks to be best.

Test set comparison of AUC
Test set comparison of AUC

Maybe AUC 0.93 is too good to be true?

AWS SageMaker

Sign-in to your AWS console and find the SageMaker service. First step is to launch the SageMaker Studio which is basically Jupyter. I created a test user with suggested role. Takes few minutes to get setup the first time.

SageMaker Studio setup
SageMaker Studio setup

Step 1 — Data and Setup

Goto S3 and upload your training dataset. Then create a new autopilot experiment in SageMaker Studio. Fill in the form and start it.

Create an autopilot experiment
Create an autopilot experiment

It takes some time so check back later. Once the “Pre-processing” stage is done you will see links to 2 notebooks. The data exploration shows some helpful details into the data. The candidate generation notebook is very interesting and actually shows what it is doing. I don’t understand all of it but looks like XGBoost is one of the models it is trying with some variations and other models I am not familiar with. It also describes the hyper parameter tuning it will do.

Auto created notebooks in SageMaker
Auto created Notebooks

Step 2 — Results

Once complete you can see various “trials” it did and the F1 score progressively improving.

Hyperparameter tuning list and F1 score
Hyperparameter tuning list and F1 score

The F1 of 0.84 matches similar F1 from my XGBoost model so that is good! You can see the model detail and it is XGBoost. Complete details are available so with some effort I can probably print out a AUC as well. A new thing I learned is SHAP values and relative feature importance.

Feature importance in terms of SHAP values
Feature importance in terms of SHAP values

Step 3 — Cleanup

Frankly SageMaker pricing is very confusing and there is a 2 month free-tier so I think my usage falls into that but its very hard to see what is running and potential cost. My training apparently happened on ml.m5.4xlarge so I do owe AWS some $$ (not sure where it is set to use the free tier version).

  • From within SageMaker Studio make sure to stop all kernels and instances in the “Running Terminals and Kernels” tab
  • Any models you deploy show in deployments and make sure you delete them
  • Look in the SageMaker dashboard and nothing should be running
  • I’m not clear if leaving SageMaker studio is ok so delete that too if you don’t plan to use it again
  • Cleanup your S3 bucket (SageMaker actually adds a lot of files here)
  • Additionally SageMaker exhausts your free-tier KMS and S3 calls quota so expect some charge there as well
Recent SageMaker activity
Recent SageMaker activity

I think I have stopped everything but will check back in few days to see if any new cost got added.

Google AutoML

Sign-in to GCP and create a new project (this is paid only so you need a billing account linked to your project). Under the hamburger menu group for Artificial Intelligence pick “Natural Language.” As a one-time step it asks to “enable API” but subsequently you will always have the dashboard.

GCP Natural Language dashboard
GCP Natural Language dashboard

For my case, the first one applies — text and document classification.

Step 1 — Data and Setup

So Google is not as forgiving as pandas and before uploading csv it needs some cleanup. Some pointers in GCP docs. So I had to do some cleanup (see notebook). Google is actually very picky about what you can upload and your csv file can ONLY contain 3 columns. Their documentation is not clear and it gives unhelpful errors. So I could not reuse the file that worked with AWS. Additionally Google AutoML classification is basically only based on text and no other features.

Create a dataset and “single label classification” is what applies in this problem. Then provide the file and pick a bucket where to save it. Import the data next and start training.

Create Dataset
Create Dataset
Start training
Start training

Step 2 — Results

The results are very simple and there is no notebook or any details so no idea what they did.

Training results
Training results

Overall Google is very simple but not much to learn.

This topic is unrelated to my everyday work but if this interests you then reach out to me and I will appreciate any feedback. If you would like to work on other problems, you will generally find open roles as well! Please refer to LinkedIn.

--

--