You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/docs/docs/integrations/vectorstores/databricks_vector_search.ipynb

233 lines
6.0 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Databricks Vector Search\n",
"\n",
"Databricks Vector Search is a serverless similarity search engine that allows you to store a vector representation of your data, including metadata, in a vector database. With Vector Search, you can create auto-updating vector search indexes from Delta tables managed by Unity Catalog and query them with a simple API to return the most similar vectors.\n",
"\n",
"This notebook shows how to use LangChain with Databricks Vector Search."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Install `databricks-vectorsearch` and related Python packages used in this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet langchain-core databricks-vectorsearch langchain-openai tiktoken"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `OpenAIEmbeddings` for the embeddings."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Split documents and get embeddings."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_openai import OpenAIEmbeddings\n",
"from langchain_text_splitters import CharacterTextSplitter\n",
"\n",
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()\n",
"emb_dim = len(embeddings.embed_query(\"hello\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup Databricks Vector Search client"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from databricks.vector_search.client import VectorSearchClient\n",
"\n",
"vsc = VectorSearchClient()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a Vector Search Endpoint\n",
"This endpoint is used to create and access vector search indexes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vsc.create_endpoint(name=\"vector_search_demo_endpoint\", endpoint_type=\"STANDARD\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Direct Vector Access Index\n",
"Direct Vector Access Index supports direct read and write of embedding vectors and metadata through a REST API or an SDK. For this index, you manage embedding vectors and index updates yourself."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vector_search_endpoint_name = \"vector_search_demo_endpoint\"\n",
"index_name = \"ml.llm.demo_index\"\n",
"\n",
"index = vsc.create_direct_access_index(\n",
" endpoint_name=vector_search_endpoint_name,\n",
" index_name=index_name,\n",
" primary_key=\"id\",\n",
" embedding_dimension=emb_dim,\n",
" embedding_vector_column=\"text_vector\",\n",
" schema={\n",
" \"id\": \"string\",\n",
" \"text\": \"string\",\n",
" \"text_vector\": \"array<float>\",\n",
" \"source\": \"string\",\n",
" },\n",
")\n",
"\n",
"index.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.vectorstores import DatabricksVectorSearch\n",
"\n",
"dvs = DatabricksVectorSearch(\n",
" index, text_column=\"text\", embedding=embeddings, columns=[\"source\"]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Add docs to the index"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dvs.add_documents(docs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Similarity search"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"dvs.similarity_search(query)\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Work with Delta Sync Index\n",
"\n",
"You can also use `DatabricksVectorSearch` to search in a Delta Sync Index. Delta Sync Index automatically syncs from a Delta table. You don't need to call `add_text`/`add_documents` manually. See [Databricks documentation page](https://docs.databricks.com/en/generative-ai/vector-search.html#delta-sync-index-with-managed-embeddings) for more details."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dvs_delta_sync = DatabricksVectorSearch(\"catalog_name.schema_name.delta_sync_index\")\n",
"dvs_delta_sync.similarity_search(query)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}