With the rise of large language models, semantic search is gaining a lot of popularity. Semantic search is essentially a "nearest neighbor” (also known as k-NN) search in a pile of vectors with attached metadata. We call this pile a vector database. If you're an engineer trying to bring an LLM-based prototype into production, you might need help with choosing the right one.
In the proof-of-concept phase, it usually doesn’t matter which database you choose to store your vectors. This changes when you get to the prototype or even the production phase. The pile of vectors can grow significantly. You might need to control access to the pile, or you might not want to lose the pile when the server goes down.
There are a lot of different products out there, each with its target audience. As no one size fits all use cases, let’s focus on some aspects you might want to consider when choosing the best fit for your data management application.
Nearest neighbors, but it’s people instead of vectors. Source: Pixabay
Requirements for choosing a vector database
Every project will have different requirements. As it’s impossible to list them all, we’ll look at a few areas important not only for data management applications but for the production of AI applications in general.
Exact search or approximate search?
As with any other database, an index is required for efficient search in a large amount of records. An index is a data structure that speeds up retrieval operations in a database - the principle is the same as with an index in a book. Before choosing an index, we must ask ourselves: “Do we need completely accurate results?”
If the answer is yes, then options are limited. The only practical option here is to use a "flat" index. This index type is efficient when the query vector is present in the index, but generally, the search time complexity is linear.
However, if we don’t need a perfectly accurate result, the situation is more optimistic. By sacrificing a bit of accuracy, we can get massive performance gains, especially when the database is large. Many indices for approximate search are available, each with different properties and tradeoffs.
Looking for a performance comparison of different indices? See ANN Benchmarks
Do you expect your database to grow quickly? Then, it might be a good idea to think ahead and pay some attention to how your candidate database scales. One of the most common methods for scaling out is "sharding." The data is partitioned into multiple databases (shards), which spreads the load and improves performance. It may sound simple, but it gets more complicated in practice. Adding more shards to an existing deployment is not always supported, which usually leads to some form of downtime.
System reliability is the ability to perform required functions under defined conditions (AKA service level objectives or SLO). In a nutshell, it tells us if our vector database is reliable based on its ability to process requests within an acceptable period and without failures.
Scalability does not imply reliability, and reliability does not imply scalability. The usual method of increasing service reliability is adding more service replicas and spreading the load between them. It sounds a lot like sharding, but the key difference here is that replicas are interchangeable, while shards generally are not.
A common solution is to combine sharding and replication, but it does not solve all problems. For example, if your application does a lot of writes into the database, replication gets more difficult since keeping all replicas consistent is not easy.
There are different service architectures that perform better in write-heavy scenarios, but the topic is complex and outside the scope of this article. The key takeaway is to choose a database with architecture that fits your expected type of load.
What happens when part of your database deployment goes down? In-memory storages can’t survive that unless heavily replicated, preferably across multiple regions. Designing a distributed system with such properties can be challenging, so you should consider including some form of non-volatile storage.
If your database is updated infrequently, choosing a solution with point-in-time snapshots might be worth it. On the other hand, if you expect frequent updates, you probably want something like a "write-ahead log" or a "binlog."
A public reverse image search service will have different requirements for access control than a semantic search component of a SaaS image gallery for personal photos. You might not need a complex role-based access control system, but running a service without any kind of access control in production is universally a bad idea.
Operability and regulatory compliance
Depending on your business use case, the size of your team, and many other factors, it might be worth using a solution managed by a third party. Sure, it can be more expensive, but it can also allow you to focus on solving more critical problems in your domain instead of maintaining a database cluster.
However, it’s not all unicorns and rainbows. Do you know where the data is stored? Does the provider have all the necessary certifications? Is it available only with cloud provider A, but you are using cloud provider B?
In the case of self-managed solutions, there are more points of view to consider:
- If you are using Helm, does it come with a Helm chart?
- Does it publish metrics?
- How complex is the database deployment?
Running ten microservices and a full-blown message queue (think Apache Kafka) probably isn’t the right way to store 1000 vectors. However, it’s a completely legitimate setup if you have a lot of data. It doesn’t matter that a database does great in performance benchmarks when it’s painful to operate.
Another vital aspect to consider: You might already be using another database for a different use case in the same application. Check if it supports vector similarity search, either out of the box or via extension. It's not uncommon these days (e.g., pgvector extension for Postgres, OpenSearch k-NN plugins, out-of-the-box support in Redis), and operating one component is always better than operating two.
Compliance checklists can be long. Source: Pixabay
Ease of use
Does the database offer a library for your preferred language or a plugin for your framework of choice? Python packages are very common, but if your project is written in Rust, for example, it might be worth considering options that have existing support instead of spending precious time on writing your own client library.
Popular database choices
Here is a quick overview of popular vector databases out there. Please bear in mind that the area is evolving rapidly, so some information might need to be updated.
- A very popular choice in various demos and tutorials.
- Still in the Alpha version and, thus, currently not particularly suitable for production deployment.
- Unmanaged deployment is available at the moment.
- No access control.
- Only vertical scaling, no replication out of the box.
- Mature and popular database.
- Cloud-native architecture can be deployed in multiple setups based on the expected load.
- Easy horizontal scaling.
- Managed offering available in a limited number of regions in AWS and GCP.
- Vector similarity search extension for Postgres.
- While the extension is still under heavy development, Postgres is a very mature and popular database.
- Available in AWS and Azure-managed Postgres databases in all regions.
- Distributed deployment can be tricky.
- Mature and popular database.
- No self-managed option.
- Scales by adding pods (shards) and replicas.
- Hosted deployments are available in a wide range of GCP regions and a minimal range of AWS and Azure regions.
- Virtually no access control.
- Scales via sharding.
- A very lightweight option.
- Cloud offerings are available in a limited subset of AWS and GCP regions.
- Mature and popular database.
- Employs leaderless replication (similar to DynamoDB or Cassandra)
- Re-sharding is neither supported nor on the current roadmap, which makes horizontal scaling tricky.
- SaaS offering available in GCP, hybrid SaaS
Vector database selection in a nutshell
Tl;dr? Jump right here. Source: Pixabay
Did you scroll down here right away because all you want is a recommendation? Here are some personal opinions:
- "I want something lightweight and simple that can run on my local machine." -> Use Chroma.
- "I don't care much, and I'm ready to throw money at the problem." -> Go with Pinecone.
- "I have Postgres. Can I use that?" -> Sure, try pgvector.
- "I'm a Rustacean." -> You might like Qdrant.
Picking a vector database for your application isn’t about finding “the best one,” but the one that matches your needs the best.
Besides pure performance, consider other aspects that might not be immediately obvious. If you’re considering using a third-party solution, ensure it meets all legal requirements and works well with your current systems. By considering all these points, we hope you will find the correct database that helps your app run smoothly.
Want to learn more about Gen AI, data management, and the future of both industries? Check out our fireside chat with leaders from T-Mobile and OpenAI.