DatabricksVectorSearch
Databricks Vector Search is a serverless similarity search engine that allows you to store a vector representation of your data, including metadata, in a vector database. With Vector Search, you can create auto-updating vector search indexes from Delta tables managed by Unity Catalog and query them with a simple API to return the most similar vectors.
This notebook shows how to use LangChain with Databricks Vector Search.
Setup
To access Databricks models you'll need to create a Databricks account, set up credentials (only if you are outside Databricks workspace), and install required packages.
Credentials (only if you are outside Databricks)
If you are running LangChain app inside Databricks, you can skip this step.
Otherwise, you need manually set the Databricks workspace hostname and personal access token to DATABRICKS_HOST
and DATABRICKS_TOKEN
environment variables, respectively. See Authentication Documentation for how to get an access token.
import getpass
import os
os.environ["DATABRICKS_HOST"] = "https://your-databricks-workspace"
if "DATABRICKS_TOKEN" not in os.environ:
os.environ["DATABRICKS_TOKEN"] = getpass.getpass(
"Enter your Databricks access token: "
)
Installation
The LangChain Databricks integration lives in the langchain-databricks
package.
%pip install -qU langchain-databricks
Create a Vector Search Endpoint and Index (if you haven't already)
In this section, we will create a Databricks Vector Search endpoint and an index using the client SDK.
If you already have an endpoint and an index, you can skip the section and go straight to "Instantiation" section.
First, instantiate the Databricks VectorSearch client:
from databricks.vector_search.client import VectorSearchClient
client = VectorSearchClient()
Next, we will create a new VectorSearch endpoint.
endpoint_name = "<your-endpoint-name>"
client.create_endpoint(name=endpoint_name, endpoint_type="STANDARD")
Lastly, we will create an index that cna be queried on the endpoint. There are two types of indexes in Databricks Vector Search and the DatabricksVectorSearch
class support both use cases.
-
Delta Sync Index automatically syncs with a source Delta Table, automatically and incrementally updating the index as the underlying data in the Delta Table changes.
-
Direct Vector Access Index supports direct read and write of vectors and metadata. The user is responsible for updating this table using the REST API or the Python SDK.
Also for delta-sync index, you can choose to use Databricks-managed embeddings or self-managed embeddings (via LangChain embeddings classes).
The following code creates a direct-access index. Please refer to the Databricks documentation for the instruction to create the other type of indexes.
index_name = "<your-index-name>" # Format: "<catalog>.<schema>.<index-name>"
index = client.create_direct_access_index(
endpoint_name=endpoint_name,
index_name=index_name,
primary_key="id",
# Dimension of the embeddings. Please change according to the embedding model you are using.
embedding_dimension=3072,
# A column to store the embedding vectors for the text data
embedding_vector_column="text_vector",
schema={
"id": "string",
"text": "string",
"text_vector": "array<float>",
# Optional metadata columns
"source": "string",
},
)
index.describe()
Instantiation
The instantiation of DatabricksVectorSearch
is a bit different depending on whether your index uses Databricks-managed embeddings or self-managed embeddings i.e. LangChain Embeddings object of your choice.
If you are using a delta-sync index with Databricks-managed embeddings:
from langchain_databricks.vectorstores import DatabricksVectorSearch
vector_store = DatabricksVectorSearch(
endpoint=endpoint_name,
index_name=index_name,
)
If you are using a direct-access index or a delta-sync index with self-managed embeddings, you also need to provide the embedding model and text column in your source table to use for the embeddings:
- OpenAI
- Azure
- AWS
- HuggingFace
- Ollama
- Cohere
- MistralAI
- Nomic
- NVIDIA
- Fake
pip install -qU langchain-openai
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass()
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
pip install -qU langchain-openai
import getpass
os.environ["AZURE_OPENAI_API_KEY"] = getpass.getpass()
from langchain_openai import AzureOpenAIEmbeddings
embeddings = AzureOpenAIEmbeddings(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
)
pip install -qU langchain-google-vertexai
from langchain_google_vertexai import VertexAIEmbeddings
embeddings = VertexAIEmbeddings(model="text-embedding-004")
pip install -qU langchain-aws
from langchain_aws import BedrockEmbeddings
embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0")
pip install -qU langchain-huggingface
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model="sentence-transformers/all-mpnet-base-v2")
pip install -qU langchain-ollama
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="llama3")
pip install -qU langchain-cohere
import getpass
os.environ["COHERE_API_KEY"] = getpass.getpass()
from langchain_cohere import CohereEmbeddings
embeddings = CohereEmbeddings(model="embed-english-v3.0")
pip install -qU langchain-mistralai
import getpass
os.environ["MISTRALAI_API_KEY"] = getpass.getpass()
from langchain_mistralai import MistralAIEmbeddings
embeddings = MistralAIEmbeddings(model="mistral-embed")
pip install -qU langchain-nomic
import getpass
os.environ["NOMIC_API_KEY"] = getpass.getpass()
from langchain_nomic import NomicEmbeddings
embeddings = NomicEmbeddings(model="nomic-embed-text-v1.5")
pip install -qU langchain-nvidia-ai-endpoints
import getpass
os.environ["NVIDIA_API_KEY"] = getpass.getpass()
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
embeddings = NVIDIAEmbeddings(model="NV-Embed-QA")
pip install -qU langchain-core
from langchain_core.embeddings import FakeEmbeddings
embeddings = FakeEmbeddings(size=4096)
vector_store = DatabricksVectorSearch(
endpoint=endpoint_name,
index_name=index_name,
embedding=embeddings,
# The column name in the index that contains the text data to be embedded
text_column="document_content",
)
Manage vector store
Add items to vector store
Note: Adding items to vector store via add_documents
method is only supported for a direct-access index.
from langchain_core.documents import Document
document_1 = Document(page_content="foo", metadata={"source": "https://example.com"})
document_2 = Document(page_content="bar", metadata={"source": "https://example.com"})
document_3 = Document(page_content="baz", metadata={"source": "https://example.com"})
documents = [document_1, document_2, document_3]
vector_store.add_documents(documents=documents, ids=["1", "2", "3"])
['1', '2', '3']
Delete items from vector store
Note: Deleting items to vector store via delete
method is only supported for a direct-access index.
vector_store.delete(ids=["3"])
True
Query vector store
Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent.
Query directly
Performing a simple similarity search can be done as follows:
results = vector_store.similarity_search(
query="thud", k=1, filter={"source": "https://example.com"}
)
for doc in results:
print(f"* {doc.page_content} [{doc.metadata}]")
* foo [{'id': '1'}]
Note: By default, similarity search only returns the primary key and text column. If you want to retrieve the custom metadata associated with the document, pass the additional columns in the columns
parameter when initializing the vector store.
vector_store = DatabricksVectorSearch(
endpoint=endpoint_name,
index_name=index_name,
embedding=embeddings,
text_column="text",
columns=["source"],
)
results = vector_store.similarity_search(query="thud", k=1)
for doc in results:
print(f"* {doc.page_content} [{doc.metadata}]")
* foo [{'source': 'https://example.com', 'id': '1'}]
If you want to execute a similarity search and receive the corresponding scores you can run:
results = vector_store.similarity_search_with_score(
query="thud", k=1, filter={"source": "https://example.com"}
)
for doc, score in results:
print(f"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]")
* [SIM=0.414035] foo [{'source': 'https://example.com', 'id': '1'}]
Query by turning into retriever
You can also transform the vector store into a retriever for easier usage in your chains.
retriever = vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 1})
retriever.invoke("thud")
[Document(metadata={'source': 'https://example.com', 'id': '1'}, page_content='foo')]
Usage for retrieval-augmented generation
For guides on how to use this vector store for retrieval-augmented generation (RAG), see the following sections:
- Tutorials: working with external knowledge
- How-to: Question and answer with RAG
- Retrieval conceptual docs
API reference
For detailed documentation of all DatabricksVectorSearch features and configurations head to the API reference: https://api.python.langchain.com/en/latest/vectorstores/langchain_databricks.vectorstores.DatabricksVectorSearch.html
Related
- Vector store conceptual guide
- Vector store how-to guides