Less than a year ago we started working on incorporating product feeds into Two Tap. Product catalogs have traditionally not been our focus, but we’re working on a number of projects that depend on having a great product database.
Two Tap has to take large CSV files from thousands of retailers, each with their own way of having product information, and show them under one unified (and sane) API. Building a feed injection pipeline is not the purpose of this blog post, though we might write about it at a later time.
The goal today is to describe how we are taking raw product information and assigning products to our own taxonomy. There are lot of companies that offer this as a service. However we’ve decided to open-source everything related to the classification component. You’ll get the code and the pre-trained models. And we’re hoping you might want to contribute back if you find it useful.
- This is v1.0 of the model. This is not only our first ML project, it’s also our first Python initiative in a long time.
- The input data is manually labeled by our team. If you find any errors please let us know.
- This first model is really bad at categorizing electronics. The reason is the training data is mostly Apparel.
- It’s relatively easy for us to add more products types if we already support a retailer. Ping us with the stores you’d like to see in our dataset, and if they agree, we’ll include them.
- We expect to release a new version once every month or so.
How we labeled the data:
The goal is to allow publishers to built an experience like taptapcart.com, the demo B2C marketplace we created on top of our API. The first version of taptapcart was reliant on retailer taxonomy, and it was a huge mess to navigate. Some stores have “Clothing”, others have “Apparel”, some have only deeply nested categories.
Two Tap’s team enables a feed inside Two Tap after applying to the store’s affiliate program and receiving access to their daily updated CSV file containing lots of products. Once the feed is injected and processed, Two Tap’s internal dashboard allows our team to go and manually assign store categories to a Two Tap taxonomy. The Two Tap taxonomy is a combination of the categories provided by buy.com and Google Shopping.
This is our input data.
How we built the model:
We decided to use keras, jupyter notebook, and our puny laptops for development work. We took a snapshot of our product DB and created one big training.csv file.
The model we ended up with learns based on the product title and images. We tried different approaches, like:
- binary_crossentropy and training with each subcategory (eg the label data for Apparel & Accessories~~Clothing~~Activewear was [ 1, 1, 1, 0, 0 .. ] where the index is [ ‘Apparel & Accessories’, ‘Apparel & Accessories~~Clothing’, ‘Apparel & Accessories~~Clothing~~Activewear’, ..]
- categorical_crossentropy which was an mistake as it penalises matches on Apparel & Accessories~~Clothing~~Activewear if it finds Apparel & Accessories~~Clothing.
- With or without the product categories. With or without the first sentence from the product description.
In the end what worked the best, to our suprise, was the simplest possible approach. Just title + images + one hot encoding.
As we were working on the category model, there was a moment when nothing was working. We decided to take a step back and tackle a “simpler” problem: product image colors.
Primary image colors can be trickier to figure out that it appears: there’s skin color, background color, multiple colors, etc.
Instead of manually labelling the data we took a different approach. We looped through each product in our DB and looked at the color text names. If it matched some common names: “black”, “silver”, “gold”, “tan”, “purple”, we would save it.
Then, for one grueling day, we manually looked through 250k product images to clean up any mistakes. The end result were directories of images like “black/, "blue/”, etc.
To train this model we again used Inception V3. However, even with this simple use case, clear training data, and a pre-created architecture like Inception, the first model was terrible.
This is where we learned about class_weights. Our data set was skewed, we had 90k black, 20k blue, and 4k green. With class_weights you can penalize a mistake on green a lot higher than one on black.
This was a huge learning for the category model as well, and things started working after incorporating this change.
How we trained the model:
Once we were happy with the initial results we wanted to train the whole model. There were about 800k data points for categories, meaning 800k product images. We choose to use Azure as their GPU instances are pretty great.
After generating an Azure File Storage and mounting via SMB to a smaller instance, we ran a script called download_images.py that went through our training data to fetch all the required images. This took a couple of days.
Once we had the images, we created another instance, a NC12 (12 cores, 112 GB memory), and mounted the SMB file share there. We used Ubuntu and this incredibly helpful script: https://github.com/leestott/Azure-GPU-Setup to enable tensorflow/keras with GPU support. I wouldn’t waste time with Azure’s precreated data-science instances, this approach provided more flexibility and was incredibly easy to use.
The last step was to start a ‘screen’ session that runs train.py. We first trained the color model, and then the categories one.
Check out the repos
Test it out
Where do we go from here?
The next step is incorporating this model into our backend. To that effect we’ll be using a flask instance sitting on a smaller Azure server processing jobs.
We’ll be going back to have a closer look at the labeled data.
We want to add feeds from a more diverse set of retailers (the data set is highly skewed toward apparel at this point).
Once this project matures a bit we hope to extend it for HS code labelling.
One more thing…
We’ve launched one more OSS project. A WooCommerce plugin that allows non US retailers to sell US products just by drag & dropping items from Two Tap’s inventory.
Check it out here. We’d love some feedback!