Keyword-based search is a well-understood and widely known technology. The most naïve linear search has three major issues. First, it is difficult to scale up to large text corpora. The system has to scan through all the documents to locate keywords which becomes very slow if the corpora grow too big. Second, it is hard to estimate the relevance of returned documents to the queried keywords, which makes ranking the results difficult. Third, a user needs to carefully craft a query string to get good results, which is often a big challenge if the user is not familiar with the topic.
Google’s PageRank algorithm tackled the first two challenges. The whole problem is beautifully formulated as a random walk problem, the core idea is to utilize the underlying graph structure derived from the links between webpages. The intuition is that similar pages are often linked together, and their relevance correlate to the graph topological structure. However, the third challenge remains. A user needs to know what to ask first before throwing keywords to Google. Quite often, we need to look for related information for an article of topics we are not familiar with. This is exactly where semantic search excels.
Vector space plays a key role in semantic search. The documents are converted to their vector representation with a well-defined mathematical function. In other words, they are mapped to different points in a high-dimensional vector space. If this mapping function is smart enough, their distance in the high-dimensional space will correlate to their semantic meanings. This further indicates that searching for relevant document for a given article becomes a problem of searching for nearest neighbors of a given point. There are different ways of constructing this smart function to convert articles to points. The most naïve one can be the Boolean model, TF-IDF, LSA, or LDA. In the recent year, the breakthrough in the deep neural network, especially the transformer-based language models, significantly boosts the performance of embedding functions. These advanced models can capture context information and semantic meaning more accurately in an article. In other words, by placing these points even better in the vector space, we can better results.
With the embedding function sorted out, there is another challenge needs to be solved before we can build real world semantic search applications. This challenge originates from search per se. For a linear search, we need to first compute pairwise distance for all the points, then rank the results to identify top k candidates. As we can see, the computational complexity grows exponentially for increasing dimensionality and number of points in the vector space. Therefore, auxiliary data structures or functions are often used to speed up the search. The popular methods include tree-based data structure, graph-based data structure, and Locality-Sensitivity Hashing. These methods usually return approximate nearest neighbors, the tradeoff between accuracy and performance can be tuned with hyper parameters. Often, tree-based and graph-based methods have much better search quality.
Back to 2015, before the breakthrough in the embedding technology, we had already started our research in optimizing the searching in high-dimensional vector space. We used a technique called random projection combined with binary tree to divide vector space into subspaces. By doing so, we achieve two things altogether. First, we are able to cluster the points; second, we are able to fit the classification into a compact tree structure. The clusters are the leaves of the binary tree. These cluster often contains errors, because some information get lost during the dimensionality reduction in random projection. However, these errors can be mitigated by using more trees. We later also did a lot of innovative work to further optimize the data structure to reduce its memory footprint, as well as advanced features such as auto-tuning and sinkhorn algorithm to re-rank the results.
Fast forward to the present day, our core MRPT algorithm has been continuously indexing incoming corpora from our customers and serving millions of requests everyday on our platform. In addition to this core technology, the most exciting development is our platform, which is built atop of a service mesh. Combined with a DSL (Domain Specific Language) designed by ourselves, the iKVA platform can construct a dynamic NLP pipeline for each incoming request. The dynamic pipeline brings two immediate benefits. First, it offers a powerful mechanism to control how the request should be answered at a much greater granularity, which allows us capture even more context information. Second, the pipeline can be programmed at backend and hot plug in on our platform. This is very valuable for our third-party developers. Overall, this feature is a strategic step for iKVA to start developing its IP on the platform, which will bring a significant advantage over competitors.
Dr Liang Wang
iKVA CSO & Founder
JOIN US FOR OUR SERIES OF FREE ONLINE WEBINARS:
Join Prof Richard Mortier for an explanation of how nearest neighbour vector search technology can empower organisations to make better, data-driven decisions.
29, Sept 2022, 15.30pm
Join iKVA, CPO, Ian Forth and iKVA CEO, Jon Horden as they explain how modern advances in technology along with recent developments in research are enabling organisations to make dramatic improvements in the way they discover and consume knowledge.
06, Oct 2022, 15.30