Vector Indexing in Vector Databases

March 13, 2024

/

LLM, Machine Learning, PostgreSQL, Vector Databases

A vector database stores data in the form of high-dimensional vectors. These vectors are embeddings of the original text, Images, audio or video files we want to store. In traditional databases, we store information in the form of relations where we depend on SQL to manipulate or handle data, and return exact matches of a query as a result, whereas Vector databases store any format of data in the form of embeddings and do not just focus on exact matches. In this article, we will briefly learn about what is vector indexing and a few types of popular vector indexing techniques across different vector databases. We shall also witness the vector indexing techniques supported by Open-Source vector databases like : PostgreSQL, Weaviate, QDrant, Redis, Milvus, and Vespa.

Vector databases help AI to improve its capabilities with semantic information retrieval and long-term memory. Similar to Indexing in traditional databases for performance optimization, Vector databases does support indexing capabilities to significantly increase the speed of the search (similarity search) with only a minimal trade-off in search accuracy.

What is a Vector database ?

A Vector Database is designed to store data in the form of vectors i.e. numerical representations that capture the essence of the data’s characteristics. This approach helps to solve a significant challenge of understanding the relationship within textual data. By converting text into embedding that reflects semantic relationships, these databases help machine learning models to process, recall, and identify relationships effectively. Vector databases have many use cases across different domains and applications that involve Natural Language Processing (NLP), Computer Vision (CV), Recommendation Systems (RS), and other areas that require semantic understanding and matching of data.

What are Vectors and Embeddings ?

In general mathematics, a vector is an entity with magnitude and direction. However, within the context of vector databases, a vector is a list of real numbers representing different features.

A real-valued vector that encodes the meaning of the word is called an embedding, which is done in such a way that the words that are closer in the vector space are similar in meaning. Embedding involves translating high-dimensional data into vectors within a relatively low-dimensional space.

Embedding Process Visualization

Vector Similarity Search

In traditional databases, we search for an exact match and fetch results, but in vector databases, we use similarity searching metrics (like cosine similarity, euclidean distances etc.,) which calculate the distance between the vectors in a higher dimensional space and fetch vectors that are most similar to our query vector. This is referred to as Approximate Nearest Neighbor (ANN) search.

Storing and Searching of content and query in Vector Database

In detail, Approximate nearest neighbors search is a technique where the goal is to efficiently find points in a dataset that are close to a given query point. Instead of guaranteeing an exact match, it focuses on providing a solution that is close to the true nearest neighbor, often with a trade-off between accuracy and computational efficiency. This approach is particularly useful in high-dimensional spaces where exhaustive search becomes impractical.

Subscribe to our Newsletters

Vector Indexing

Vectors are not arbitrarily searched, an index is used to improve the speed of data retrieval. It is how we organize data so we find what we are looking for quickly. An index is not necessarily entire data but contains important parts of the data.

There are different indexing methods used in vector databases, In this article, we will briefly look at the following Vector Indexing techniques.

Flat Indexing
Locality Sensitive Hashing (LSH)
Inverted file index
Hierarchical Navigable Small World Graph

Flat Indexing

In this method, vectors are stored without any modifications and are searched exhaustively i.e., we search the similarity of every vector against the query vector and return K vectors with the closest similarity scores.

Although this is easy to implement, Flat Indexing is inherently slow and best suited for small datasets, where the high accuracy outweighs the concern with the speed.

Searching in 3D space

Locality Sensitive Hashing (LSH)

Locality Sensitive Hashing (LSH) is an indexing technique that uses hashing to optimize the speed and find approximate neighbors to the query. Regular hashing and hashing in locality sensitive hashing (LSH) have different goals. In LSH, the goal is to Map similar data points (vectors) in a high-dimensional space to the same hash bucket with high probability. This means even if two points aren’t identical, as long as they’re close together, they’re likely to collide in the hash table.

During Locality Sensitive Hashing (LSH) retrieval, the query vector is hashed using a specialized function. This significantly reduces the dimensionality of the search space, requiring comparisons only with vectors within the identified bucket. This technique offers substantial performance gains by minimizing the number of distance calculations needed.

Inverted File Index

In Inverted File Indexing, we divide the vectors into clusters of vectors that are similar to each other, and find their respective centroids. While searching for the query vector, we compare it to the centroids of the clusters and then further search in the clusters respective to the centroids which are most similar to the query vector. A centroid is a data point that represents the center of the cluster (the mean), and it might not necessarily be a member of the dataset. “Nprobe” indicates the number of closest centroids that have to be considered, this similarly indicates the number of clusters we prefer to compare our query vector with.

This indexing is like having books grouped into sections (clusters) like fiction, and non-fiction in a library. When you’re looking for a specific book (your input vector), instead of checking every single book in the library (all vectors), you quickly focus on just one or two sections that are most likely to have the book you want.

Hierarchical Navigable Small World Graph

HNSW (Hierarchical Navigable Small World Graph) constructs a multi-layered navigable small world graph. Each data point (vector) resides as a node within the graph, connected to a limited number of its closest neighbors based on distance metrics. This structure facilitates efficient exploration towards similar vectors for a given query.

When a search query arrives, HNSW starts by comparing it to nodes in the top layer, finding the closest matches within that local area. Here’s where the shortcuts come in. HNSW utilizes them to explore promising neighbors in deeper layers, focusing on those similar to the top layer matches. This refines the search and gets closer to the most relevant data points. The search process continues through the layers using shortcuts, ultimately retrieving the most similar data points from the bottom layer.

Advantages of HNSW Indexing

Enhanced Search Efficiency: Compared to linear search methods, HNSW significantly reduces the number of distance calculations required, leading to remarkably faster retrieval of similar vectors.

Scalability: HNSW demonstrates excellent scalability with large datasets due to its efficient graph-based search approach.

Approximate Nearest Neighbors: While HNSW prioritizes speed over identifying the absolute closest neighbor, the retrieved vectors exhibit a high degree of similarity to the query, making it a valuable ANN technique.

Open Source Vector Databases

There are several Open-Source vector databases leveraged by users across the world. While there are multiple commercial vector databases, as a user, you may benefit from the following advantages with Open-Source vector databases.

Advantages of using Open-Source databases

Cost-effective: Open-source vector databases offer a significant cost advantage compared to commercially licensed databases. This makes them an attractive option for startups, research institutions, and cost-conscious organizations.

Customization: Open-source nature allows for complete customization and control over the database. Users can modify the source code to add specific features and also contribute the same back to the Community. This flexibility is beneficial for advanced users and researchers who require specialized functionalities.

Transparency: Open-source code allows for transparent operation and enables community collaboration. Users can readily examine and understand the underlying algorithms and functionalities, promoting trust and security. Additionally, a vibrant community often exists around open-source projects, offering valuable support and insights for users.

Here are some of the most popular Open-Source vector databases and information on whether they support the indexing techniques we discussed above.

Database	Flat	IVF	HNSW	Read More
Weaviate	✅		✅	Click me
QDrant		✅	✅	Click me
Redis	✅		✅	Click me
Milvus	✅	✅	✅	Click me
Vespa	✅		✅	Click me

PostgreSQL with pgvector

PostgreSQL is an advanced Open Source database which is widely adopted for its Relational as well as NoSQL capabilities. In addition to that, PostgreSQL also supports vector database features. Thanks to the contributors of pgvector, an extension adding vector database features to PostgreSQL.

pgvector Indexing techniques

pgvector has emerged as one of the popular extensions in a very short span. It initially supported IVFFLAT Indexing technique and got the support for HNSW Indexes added in Aug, 2023. We will cover more details about these Indexing techniques in our future articles.

Conclusion

In conclusion, the evolution and adoption of vector databases mark a significant advancement in AI. Vector databases excel in understanding and retrieving data based on semantic similarity, thanks to their use of embeddings and advanced indexing techniques. PostgreSQL is also increasing its popularity as a vector database through its extension : pgvector. If you are interested in trying PostgreSQL and its vector features, contact us today and we can support you.

Seeking expertise on implementing Machine Learning Use Cases ? Looking for PostgreSQL support or Database Migrations assistance ? Get experts advice for effective data management, query tuning, database migrations and database optimization. Click here for personalized expertise to elevate your database performance and meet your unique business needs.

Subscribe to our Newsletters and Stay tuned for more interesting topics.

Contact Us Today!

Author

Surya Sree Bathini

Surya Sree is working as a Machine Learning Engineer and a Backend Developer at HexaCluster. Surya Sree holds dual degree from Top Universities like IIT and IIIT. She is passionate about Machine Learning and Open-source databases since the beginning of her academics. She has helped multiple Customers of HexaCluster build AI Chatbots using Advanced RAG techniques and also supported her Customers with various AI Use Cases.

Add A Comment Cancel reply

Our PostgreSQL Services

Contact Us

Subscribe to our Free Newsletters

Our Machine Learning Services