langchain/docs/docs/integrations/vectorstores/dingo.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "683953b3",
   "metadata": {},
   "source": [
    "# DingoDB\n",
    "\n",
    ">[DingoDB](https://dingodb.readthedocs.io/en/latest/) is a distributed multi-mode vector database, which combines the characteristics of data lakes and vector databases, and can store data of any type and size (Key-Value, PDF, audio, video, etc.). It has real-time low-latency processing capabilities to achieve rapid insight and response, and can efficiently conduct instant analysis and process multi-modal data.\n",
    "\n",
    "This notebook shows how to use functionality related to the DingoDB vector database.\n",
    "\n",
    "To run, you should have a [DingoDB instance up and running](https://github.com/dingodb/dingo-deploy/blob/main/README.md)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a62cff8a-bcf7-4e33-bbbc-76999c2e3e20",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%pip install --upgrade --quiet  dingodb\n",
    "# or install latest:\n",
    "%pip install --upgrade --quiet  git+https://git@github.com/dingodb/pydingo.git"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a0f9e02-8eb0-4aef-b11f-8861360472ee",
   "metadata": {},
   "source": [
    "We want to use OpenAIEmbeddings so we have to get the OpenAI API Key."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8b6ed9cd-81b9-46e5-9c20-5aafca2844d0",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "OpenAI API Key:········\n"
     ]
    }
   ],
   "source": [
    "import getpass\n",
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "aac9563e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import TextLoader\n",
    "from langchain_community.vectorstores import Dingo\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "from langchain_text_splitters import CharacterTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a3c3999a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import TextLoader\n",
    "\n",
    "loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
    "documents = loader.load()\n",
    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "docs = text_splitter.split_documents(documents)\n",
    "\n",
    "embeddings = OpenAIEmbeddings()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "dcf88bdf",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from dingodb import DingoDB\n",
    "\n",
    "index_name = \"langchain_demo\"\n",
    "\n",
    "dingo_client = DingoDB(user=\"\", password=\"\", host=[\"127.0.0.1:13000\"])\n",
    "# First, check if our index already exists. If it doesn't, we create it\n",
    "if (\n",
    "    index_name not in dingo_client.get_index()\n",
    "    and index_name.upper() not in dingo_client.get_index()\n",
    "):\n",
    "    # we create a new index, modify to your own\n",
    "    dingo_client.create_index(\n",
    "        index_name=index_name, dimension=1536, metric_type=\"cosine\", auto_id=False\n",
    "    )\n",
    "\n",
    "# The OpenAI embedding model `text-embedding-ada-002 uses 1536 dimensions`\n",
    "docsearch = Dingo.from_documents(\n",
    "    docs, embeddings, client=dingo_client, index_name=index_name\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3aae49e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import TextLoader\n",
    "from langchain_community.vectorstores import Dingo\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "from langchain_text_splitters import CharacterTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "a8c513ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "docs = docsearch.similarity_search(query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "fc516993",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(docs[0].page_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1eca81e4",
   "metadata": {},
   "source": [
    "### Adding More Text to an Existing Index\n",
    "\n",
    "More text can embedded and upserted to an existing Dingo index using the `add_texts` function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e40d558b",
   "metadata": {},
   "outputs": [],
   "source": [
    "vectorstore = Dingo(embeddings, \"text\", client=dingo_client, index_name=index_name)\n",
    "\n",
    "vectorstore.add_texts([\"More text!\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcb858a8",
   "metadata": {},
   "source": [
    "### Maximal Marginal Relevance Searches\n",
    "\n",
    "In addition to using similarity search in the retriever object, you can also use `mmr` as retriever."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "649083ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = docsearch.as_retriever(search_type=\"mmr\")\n",
    "matched_docs = retriever.invoke(query)\n",
    "for i, d in enumerate(matched_docs):\n",
    "    print(f\"\\n## Document {i}\\n\")\n",
    "    print(d.page_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d3831ad",
   "metadata": {},
   "source": [
    "Or use `max_marginal_relevance_search` directly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "732f58b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "found_docs = docsearch.max_marginal_relevance_search(query, k=2, fetch_k=10)\n",
    "for i, doc in enumerate(found_docs):\n",
    "    print(f\"{i + 1}.\", doc.page_content, \"\\n\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}