This is a simple example of vector similarity search using DataStax Astra DB.
This repo aims to walk the line between providing a simplified example that is not overwhelmingly complex, but still illustrates key steps you'll need to take to solve real world vector similarity use cases. There are four key use cases that this repo will illustrate:
- How to create a vector-enabled collection in Astra DB using
astrapy
; - How to generate vector embeddings using the HuggingFace transformers library and the
jinaai/jina-embeddings-v2-base-en
model; - How to insert vector embeddings and corresponding text documents in the collection;
- How to run a vector similarity search and manipulate the documents returned from the search.
While this demo does not use the Open AI libraries to call an LLM, the patterns here are applicable for building RAG use cases.
As such, the content that you are chunking and building embeddings for is contained in a text document located in towns/shadowfen.txt
. This text file was generated by ChatGPT and describes many aspects of a fictional town in a fantasy setting.
An advantage of this kinf of autogenerated content is that it's fictional and not something that ChatGPT has been trained on: if you ask ChatGPT about the fictional town of Shadowfen, it will tell you it's not a real place and that it doesn't have any information about it.
It is easy for you to leverage the content in this repository and extend it to build a RAG application if that is your goal. The output from astra_query.py
is a set of questions about Shadowfen and the most relevant chunks of content from the text file, so you can easily copy and paste the output directly into ChatGPT to get an idea of how well the content helps answer questions and then move on to an API based implementation if you so choose.
- Create a DataStax Astra account - https://astra.datastax.com
- Create a vector database within Astra
- Get a database accesss token for your database using the Astra UI
- Get the API endpoint for your database (should have a form like:
https://{uuid}-{region}.apps.astra.datastax.com/api/json
)
See Create an Astra DB Serverless database for more information and documentation on Astra DB.
Clone the repo, create a virtual environment with Python 3.9+, and activate it.
In the virtual environment, install the required dependencies:
pip install -r requirements.txt
Set the following 3 environment variables:
export ASTRA_DB_API_ENDPOINT={Replace with your Astra DB API endpoint}
export ASTRA_DB_APPLICATION_TOKEN={Replace with your database token}
export ASTRA_DB_KEYSPACE={Keyspace to use, if omitted uses a default}
From the root directory of the repo, start by executing:
python astra_create.py
You can now see the collection in the "Data Explorer" of your Astra UI.
At this point you have a collection called town_content
: it's time to chunk up our content text file and generate embeddings of each chunk using HuggingFace, then insert them into our collection.
To do this, run the astra_insert.py
script from the root directory of the repo:
python astra_insert.py
If this script completes successfully, it will print the amount of text chunks inserted into the collection along with their embedding vector.
You can now peek at the inserted documents with the "Data Explorer" of your Astra UI.
At this point, you have several dozen items in the collection, each with its embedding vector.
The astra_query.py
script has an array of several queries about Shadowfen and will retrieve the most relevant results based on a similarity search of each query.
You can modify this script to ask different questions and see which document chunks are returned.
Note that the chunking algorithm used here is fairly naive: it just works at a paragraph level. As an improvement, you may want to consider changing the chunking algorithm to use a recursive strategy or add an overlap with adjacent sentences, to see how the results change. (This is out of scope for this tutorial, just be aware that there are various chunking strategies which affect the accuracy of the retrieval.)
To run the similarity search, run the astra_query.py
script from the root directory of the repo:
python ./astra_query.py
If this runs successfully, you'll see each query printed to the console with the two best-match documents, along with their similarity score:
==============================
QUESTION: Who created Shadowfen?
------------------------------
Shadowfen, nestled in the crooked embrace of the Mirewood Forest, has a history [...]
[Similarity: 0.9337]
Shadowfen, with its deep connection to ancient magic and the mysterious swamp, [...]
[Similarity: 0.9190]
Feel free to edit the queries
list in astra_query.py
if you want to try new questions.
You may want to check the contents of
towns/shadowfen.txt
to cross-check the results.