langchain/docs/docs/integrations/vectorstores/bageldb.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# BagelDB\n",
    "\n",
    "> [BagelDB](https://www.bageldb.ai/) (`Open Vector Database for AI`), is like GitHub for AI data.\n",
    "It is a collaborative platform where users can create,\n",
    "share, and manage vector datasets. It can support private projects for independent developers,\n",
    "internal collaborations for enterprises, and public contributions for data DAOs.\n",
    "\n",
    "### Installation and Setup\n",
    "\n",
    "```bash\n",
    "pip install betabageldb\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create VectorStore from texts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.vectorstores import Bagel\n",
    "\n",
    "texts = [\"hello bagel\", \"hello langchain\", \"I love salad\", \"my car\", \"a dog\"]\n",
    "# create cluster and add texts\n",
    "cluster = Bagel.from_texts(cluster_name=\"testing\", texts=texts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='hello bagel', metadata={}),\n",
       " Document(page_content='my car', metadata={}),\n",
       " Document(page_content='I love salad', metadata={})]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# similarity search\n",
    "cluster.similarity_search(\"bagel\", k=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(Document(page_content='hello bagel', metadata={}), 0.27392977476119995),\n",
       " (Document(page_content='my car', metadata={}), 1.4783176183700562),\n",
       " (Document(page_content='I love salad', metadata={}), 1.5342965126037598)]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# the score is a distance metric, so lower is better\n",
    "cluster.similarity_search_with_score(\"bagel\", k=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# delete the cluster\n",
    "cluster.delete_cluster()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create VectorStore from docs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import TextLoader\n",
    "from langchain_text_splitters import CharacterTextSplitter\n",
    "\n",
    "loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
    "documents = loader.load()\n",
    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "docs = text_splitter.split_documents(documents)[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create cluster with docs\n",
    "cluster = Bagel.from_documents(cluster_name=\"testing_with_docs\", documents=docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the \n"
     ]
    }
   ],
   "source": [
    "# similarity search\n",
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "docs = cluster.similarity_search(query)\n",
    "print(docs[0].page_content[:102])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get all text/doc from Cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "texts = [\"hello bagel\", \"this is langchain\"]\n",
    "cluster = Bagel.from_texts(cluster_name=\"testing\", texts=texts)\n",
    "cluster_data = cluster.get()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['ids', 'embeddings', 'metadatas', 'documents'])"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# all keys\n",
    "cluster_data.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'ids': ['578c6d24-3763-11ee-a8ab-b7b7b34f99ba',\n",
       "  '578c6d25-3763-11ee-a8ab-b7b7b34f99ba',\n",
       "  'fb2fc7d8-3762-11ee-a8ab-b7b7b34f99ba',\n",
       "  'fb2fc7d9-3762-11ee-a8ab-b7b7b34f99ba',\n",
       "  '6b40881a-3762-11ee-a8ab-b7b7b34f99ba',\n",
       "  '6b40881b-3762-11ee-a8ab-b7b7b34f99ba',\n",
       "  '581e691e-3762-11ee-a8ab-b7b7b34f99ba',\n",
       "  '581e691f-3762-11ee-a8ab-b7b7b34f99ba'],\n",
       " 'embeddings': None,\n",
       " 'metadatas': [{}, {}, {}, {}, {}, {}, {}, {}],\n",
       " 'documents': ['hello bagel',\n",
       "  'this is langchain',\n",
       "  'hello bagel',\n",
       "  'this is langchain',\n",
       "  'hello bagel',\n",
       "  'this is langchain',\n",
       "  'hello bagel',\n",
       "  'this is langchain']}"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# all values and keys\n",
    "cluster_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster.delete_cluster()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create cluster with metadata & filter using metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(Document(page_content='hello bagel', metadata={'source': 'notion'}), 0.0)]"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "texts = [\"hello bagel\", \"this is langchain\"]\n",
    "metadatas = [{\"source\": \"notion\"}, {\"source\": \"google\"}]\n",
    "\n",
    "cluster = Bagel.from_texts(cluster_name=\"testing\", texts=texts, metadatas=metadatas)\n",
    "cluster.similarity_search_with_score(\"hello bagel\", where={\"source\": \"notion\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "# delete the cluster\n",
    "cluster.delete_cluster()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}