Similarities Using OpenAI Embeddings
Note: These views are all my own and do not reflect that of my employer
Intro
This week for work I had to compare some of our offerings to the offerings of other companies that offer the same offerings (any more offers?). This required me to do two things: scrape the listings of other companies and then compare it to our offerings.
While I won't be giving any code examples in this post, I will say that I used iPython (or is it Jupyter these days...) notebooks to do all the work. I found it useful because I could document my progress as I went and easily run only small blocks of code. After everything is working I plan to move it all to plain Python scripts that can easily be run on a server.
What I did
Scraping was straight forward but time consuming. I used some tricks I learnt from reading Automate the Boring Stuff by Al Sweigart to scrape the pages. To be fair, that book is pretty old by now and I haven't looked for alternatives so maybe there is something better these days rather than requests
and BeautifulSoup4
.
Now that I had the data the next step was to match product to product. The problem is that product names are slightly different between companies, but the good thing is each product has a fairly lengthy description. After a bit of querying the internet I discovered it was possible to do a semantic search on text using embeddings.
Now, I have only half-assed searched this so I'm not the best person to try learn this from, but my understanding is embedding data is simply a way to store data in a format. In this case we are storing the semantic meaning of text in a vector format. The way we get the semantic meaning is by using a pre-trained model that is good at this.
Looking through the OpenAI API pricing I saw that their embedding model was really cheap. I estimated it would cost me about $0.25USD to get the embeddings for every product description I had.
And I was right! I now had a few thousand embeddings for product descriptions. Now the thing about embeddings is that they are vectors, and the more similar a semantic meaning the closer in angle a vector will be to each other. For instance, if I had was selling an architect course and you were selling a building design course then semantically the vectors should be very close in angle. However, the semantic vector for my architect course and the semantic vector for your chemistry course would probably have very different angles.
Now, the vector's I got back from the OpenAI embedding API were 1031-dimension vectors. Cool thing is they still have angles between them, even if visualizing it is not something human's can do.
I iterated over every product we offer and compared it to every other product offered by our competitors using spatial.distance.cosine
from the scipy package. When I found the smallest cosine value that indicated the best match.
Note: It's a bit weird that it was the smallest, a lot of documentation I found (for instance, here at Google) seemed to indicate that a bigger number should correlate to a better match. ¯_(ツ)_/¯
Outcome
Looking at the results I was mostly impressed. A lot of our products matched perfectly with other products. Honestly this kinda blew my mind the fact that I had taken descriptions of products, got the semantic meaning in a vector, and then had been able to match on that. (Part of the reason I am writing this post is I don't think my coworkers can take me gushing about it anymore and I still want to gush.)
It wasn't all sunshine and rainbows though, there were a few products that CLEARLY didn't match with what had been shown. My problem is I don't know the material deep enough to begin to diagnose where the issue is. Are the descriptions too similar? Is my assumptions wrong? Did I store the wrong vector against the wrong entry? I guess I'll need to research some more.
If you have something to say don't forget to tag me (@wyrm@wyrm.one) so I can respond!
Or email me at blog (at) lennys (dot) quest.