NLU classification and auto ML
Getting started from an excellent intro from Charlie Flanagan and his Machine Learning for Business class here is some experimentation with my own models, Google AutoML and AWS SageMaker.
Problem
Product reviews from a women’s clothing ecommerce store are provided. Based on each review it has been classified as whether the reviewer would recommend that product or not. Along with each review some additional features are also provided. The goal is to create a model that can predict the likelihood of product recommendation given the customer review and some additional features.
My Attempt
Using Colab and following Charlie’s example.
Step1 — Data and Model
The raw data is something like this table. Note the many empty columns so needs some cleanup. See this notebook for various things done to cleanup the data.
The file ecommerce-reviews-full-set.csv created is what is used further in my code. Empty cells are a big problem and for missing titles which are a lot, I copied them over into the review text itself so then there is only 1 text feature.
data['Title_Review'] = data.Title.astype(str).str.cat(data['Review_Text'].astype(str), sep=' ')
The next steps are in this second notebook. The text is cleaned up removing non A-Z, converting to lowercase, removing stopwords, lemmatization and finally creating the count vector.
Step 2 — Results
Using the validation data in all cases. Classification attempt with logistic regression, AUC 0.93 and weighted F1 0.90.
Classification attempt with Naive Bayes, AUC 0.94 and weighted F1 0.90.
Classification attempt with XGBoost, AUC 0.91 and weighted F1 0.84.
Finally using the test data set Naive Bayes looks to be best.
Maybe AUC 0.93 is too good to be true?
AWS SageMaker
Sign-in to your AWS console and find the SageMaker service. First step is to launch the SageMaker Studio which is basically Jupyter. I created a test user with suggested role. Takes few minutes to get setup the first time.
Step 1 — Data and Setup
Goto S3 and upload your training dataset. Then create a new autopilot experiment in SageMaker Studio. Fill in the form and start it.
It takes some time so check back later. Once the “Pre-processing” stage is done you will see links to 2 notebooks. The data exploration shows some helpful details into the data. The candidate generation notebook is very interesting and actually shows what it is doing. I don’t understand all of it but looks like XGBoost is one of the models it is trying with some variations and other models I am not familiar with. It also describes the hyper parameter tuning it will do.
Step 2 — Results
Once complete you can see various “trials” it did and the F1 score progressively improving.
The F1 of 0.84 matches similar F1 from my XGBoost model so that is good! You can see the model detail and it is XGBoost. Complete details are available so with some effort I can probably print out a AUC as well. A new thing I learned is SHAP values and relative feature importance.
Step 3 — Cleanup
Frankly SageMaker pricing is very confusing and there is a 2 month free-tier so I think my usage falls into that but its very hard to see what is running and potential cost. My training apparently happened on ml.m5.4xlarge so I do owe AWS some $$ (not sure where it is set to use the free tier version).
- From within SageMaker Studio make sure to stop all kernels and instances in the “Running Terminals and Kernels” tab
- Any models you deploy show in deployments and make sure you delete them
- Look in the SageMaker dashboard and nothing should be running
- I’m not clear if leaving SageMaker studio is ok so delete that too if you don’t plan to use it again
- Cleanup your S3 bucket (SageMaker actually adds a lot of files here)
- Additionally SageMaker exhausts your free-tier KMS and S3 calls quota so expect some charge there as well
I think I have stopped everything but will check back in few days to see if any new cost got added.
Google AutoML
Sign-in to GCP and create a new project (this is paid only so you need a billing account linked to your project). Under the hamburger menu group for Artificial Intelligence pick “Natural Language.” As a one-time step it asks to “enable API” but subsequently you will always have the dashboard.
For my case, the first one applies — text and document classification.
Step 1 — Data and Setup
So Google is not as forgiving as pandas and before uploading csv it needs some cleanup. Some pointers in GCP docs. So I had to do some cleanup (see notebook). Google is actually very picky about what you can upload and your csv file can ONLY contain 3 columns. Their documentation is not clear and it gives unhelpful errors. So I could not reuse the file that worked with AWS. Additionally Google AutoML classification is basically only based on text and no other features.
Create a dataset and “single label classification” is what applies in this problem. Then provide the file and pick a bucket where to save it. Import the data next and start training.
Step 2 — Results
The results are very simple and there is no notebook or any details so no idea what they did.
Overall Google is very simple but not much to learn.
This topic is unrelated to my everyday work but if this interests you then reach out to me and I will appreciate any feedback. If you would like to work on other problems, you will generally find open roles as well! Please refer to LinkedIn.