You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/docs/docs/integrations/vectorstores/chroma.ipynb

573 lines
21 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "683953b3",
"metadata": {},
"source": [
"# Chroma\n",
"\n",
">[Chroma](https://docs.trychroma.com/getting-started) is a AI-native open-source vector database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0.\n",
"\n",
"\n",
"Install Chroma with:\n",
"\n",
"```sh\n",
"pip install langchain-chroma\n",
"```\n",
"\n",
"Chroma runs in various modes. See below for examples of each integrated with LangChain.\n",
"- `in-memory` - in a python script or jupyter notebook\n",
"- `in-memory with persistance` - in a script or notebook and save/load to disk\n",
"- `in a docker container` - as a server running your local machine or in the cloud\n",
"\n",
"Like any other database, you can: \n",
"- `.add` \n",
"- `.get` \n",
"- `.update`\n",
"- `.upsert`\n",
"- `.delete`\n",
"- `.peek`\n",
"- and `.query` runs the similarity search.\n",
"\n",
"View full docs at [docs](https://docs.trychroma.com/reference/Collection). To access these methods directly, you can do `._collection.method()`\n"
]
},
{
"cell_type": "markdown",
"id": "2b5ffbf8",
"metadata": {},
"source": [
"## Basic Example\n",
"\n",
"In this basic example, we take the most recent State of the Union Address, split it into chunks, embed it using an open-source embedding model, load it into Chroma, and then query it."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ae9fcf3e",
"metadata": {},
"outputs": [
{
Fix update_document function, add test and documentation. (#5359) # Fix for `update_document` Function in Chroma ## Summary This pull request addresses an issue with the `update_document` function in the Chroma class, as described in [#5031](https://github.com/hwchase17/langchain/issues/5031#issuecomment-1562577947). The issue was identified as an `AttributeError` raised when calling `update_document` due to a missing corresponding method in the `Collection` object. This fix refactors the `update_document` method in `Chroma` to correctly interact with the `Collection` object. ## Changes 1. Fixed the `update_document` method in the `Chroma` class to correctly call methods on the `Collection` object. 2. Added the corresponding test `test_chroma_update_document` in `tests/integration_tests/vectorstores/test_chroma.py` to reflect the updated method call. 3. Added an example and explanation of how to use the `update_document` function in the Jupyter notebook tutorial for Chroma. ## Test Plan All existing tests pass after this change. In addition, the `test_chroma_update_document` test case now correctly checks the functionality of `update_document`, ensuring that the function works as expected and updates the content of documents correctly. ## Reviewers @dev2049 This fix will ensure that users are able to use the `update_document` function as expected, without encountering the previous `AttributeError`. This will enhance the usability and reliability of the Chroma class for all users. Thank you for considering this pull request. I look forward to your feedback and suggestions.
1 year ago
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"# import\n",
"from langchain_chroma import Chroma\n",
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_community.embeddings.sentence_transformer import (\n",
" SentenceTransformerEmbeddings,\n",
")\n",
"from langchain_text_splitters import CharacterTextSplitter\n",
"\n",
"# load the document and split it into chunks\n",
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"\n",
"# split it into chunks\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"# create the open-source embedding function\n",
"embedding_function = SentenceTransformerEmbeddings(model_name=\"all-MiniLM-L6-v2\")\n",
"\n",
"# load it into Chroma\n",
"db = Chroma.from_documents(docs, embedding_function)\n",
"\n",
"# query it\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = db.similarity_search(query)\n",
"\n",
"# print results\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "5c9a11cc",
"metadata": {},
"source": [
"## Basic Example (including saving to disk)\n",
"\n",
"Extending the previous example, if you want to save to disk, simply initialize the Chroma client and pass the directory where you want the data to be saved to. \n",
"\n",
"`Caution`: Chroma makes a best-effort to automatically save data to disk, however multiple in-memory clients can stop each other's work. As a best practice, only have one client per path running at any given time."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "49f9bd49",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"# save to disk\n",
"db2 = Chroma.from_documents(docs, embedding_function, persist_directory=\"./chroma_db\")\n",
"docs = db2.similarity_search(query)\n",
"\n",
"# load from disk\n",
"db3 = Chroma(persist_directory=\"./chroma_db\", embedding_function=embedding_function)\n",
"docs = db3.similarity_search(query)\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "63318cc9",
"metadata": {},
"source": [
"## Passing a Chroma Client into Langchain\n",
"\n",
"You can also create a Chroma Client and pass it to LangChain. This is particularly useful if you want easier access to the underlying database.\n",
"\n",
"You can also specify the collection name that you want LangChain to use."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "22f4a0ce",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Add of existing embedding ID: 1\n",
"Add of existing embedding ID: 2\n",
"Add of existing embedding ID: 3\n",
"Add of existing embedding ID: 1\n",
"Add of existing embedding ID: 2\n",
"Add of existing embedding ID: 3\n",
"Add of existing embedding ID: 1\n",
"Insert of existing embedding ID: 1\n",
"Add of existing embedding ID: 2\n",
"Insert of existing embedding ID: 2\n",
"Add of existing embedding ID: 3\n",
"Insert of existing embedding ID: 3\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 3 in the collection\n"
]
}
],
"source": [
"import chromadb\n",
"\n",
"persistent_client = chromadb.PersistentClient()\n",
"collection = persistent_client.get_or_create_collection(\"collection_name\")\n",
"collection.add(ids=[\"1\", \"2\", \"3\"], documents=[\"a\", \"b\", \"c\"])\n",
"\n",
"langchain_chroma = Chroma(\n",
" client=persistent_client,\n",
" collection_name=\"collection_name\",\n",
" embedding_function=embedding_function,\n",
")\n",
"\n",
"print(\"There are\", langchain_chroma._collection.count(), \"in the collection\")"
]
},
{
"cell_type": "markdown",
"id": "e9cf6d70",
"metadata": {},
"source": [
"## Basic Example (using the Docker Container)\n",
"\n",
"You can also run the Chroma Server in a Docker container separately, create a Client to connect to it, and then pass that to LangChain. \n",
"\n",
"Chroma has the ability to handle multiple `Collections` of documents, but the LangChain interface expects one, so we need to specify the collection name. The default collection name used by LangChain is \"langchain\".\n",
"\n",
"Here is how to clone, build, and run the Docker Image:\n",
"```sh\n",
"git clone git@github.com:chroma-core/chroma.git\n",
"```\n",
"\n",
"Edit the `docker-compose.yml` file and add `ALLOW_RESET=TRUE` under `environment`\n",
"```yaml\n",
" ...\n",
" command: uvicorn chromadb.app:app --reload --workers 1 --host 0.0.0.0 --port 8000 --log-config log_config.yml\n",
" environment:\n",
" - IS_PERSISTENT=TRUE\n",
" - ALLOW_RESET=TRUE\n",
" ports:\n",
" - 8000:8000\n",
" ...\n",
"```\n",
"\n",
"Then run `docker-compose up -d --build`"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "74aee70e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"# create the chroma client\n",
"import uuid\n",
"\n",
"import chromadb\n",
"from chromadb.config import Settings\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
11 months ago
"\n",
"client = chromadb.HttpClient(settings=Settings(allow_reset=True))\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
11 months ago
"client.reset() # resets the database\n",
"collection = client.create_collection(\"my_collection\")\n",
"for doc in docs:\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
11 months ago
" collection.add(\n",
" ids=[str(uuid.uuid1())], metadatas=doc.metadata, documents=doc.page_content\n",
" )\n",
"\n",
"# tell LangChain to use our client and collection name\n",
"db4 = Chroma(\n",
" client=client,\n",
" collection_name=\"my_collection\",\n",
" embedding_function=embedding_function,\n",
")\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = db4.similarity_search(query)\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
11 months ago
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "9ed3ec50",
"metadata": {},
"source": [
"## Update and Delete\n",
"\n",
"While building toward a real application, you want to go beyond adding data, and also update and delete data. \n",
"\n",
"Chroma has users provide `ids` to simplify the bookkeeping here. `ids` can be the name of the file, or a combined has like `filename_paragraphNumber`, etc.\n",
"\n",
"Chroma supports all these operations - though some of them are still being integrated all the way through the LangChain interface. Additional workflow improvements will be added soon.\n",
"\n",
"Here is a basic example showing how to do various operations:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "81a02810",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'source': '../../../state_of_the_union.txt'}\n",
"{'ids': ['1'], 'embeddings': None, 'metadatas': [{'new_value': 'hello world', 'source': '../../../state_of_the_union.txt'}], 'documents': ['Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.']}\n",
"count before 46\n",
"count after 45\n"
]
}
],
"source": [
"# create simple ids\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
11 months ago
"ids = [str(i) for i in range(1, len(docs) + 1)]\n",
"\n",
"# add data\n",
"example_db = Chroma.from_documents(docs, embedding_function, ids=ids)\n",
"docs = example_db.similarity_search(query)\n",
"print(docs[0].metadata)\n",
"\n",
"# update the metadata for a document\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
11 months ago
"docs[0].metadata = {\n",
" \"source\": \"../../modules/state_of_the_union.txt\",\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
11 months ago
" \"new_value\": \"hello world\",\n",
"}\n",
"example_db.update_document(ids[0], docs[0])\n",
"print(example_db._collection.get(ids=[ids[0]]))\n",
"\n",
"# delete the last document\n",
"print(\"count before\", example_db._collection.count())\n",
"example_db._collection.delete(ids=[ids[-1]])\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
11 months ago
"print(\"count after\", example_db._collection.count())"
]
},
{
"cell_type": "markdown",
"id": "ac6bc71a",
"metadata": {},
"source": [
"## Use OpenAI Embeddings\n",
"\n",
"Many people like to use OpenAIEmbeddings, here is how to set that up."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "42080f37-8fd1-4cec-acd9-15d2b03b2f4d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# get a token: https://platform.openai.com/account/api-keys\n",
"\n",
"from getpass import getpass\n",
"\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"OPENAI_API_KEY = getpass()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "c7a94d6c-b4d4-4498-9bdd-eb50c92b85c5",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import os\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
11 months ago
"\n",
"os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "5eabdb75",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"embeddings = OpenAIEmbeddings()\n",
"new_client = chromadb.EphemeralClient()\n",
"openai_lc_client = Chroma.from_documents(\n",
" docs, embeddings, client=new_client, collection_name=\"openai_collection\"\n",
")\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = openai_lc_client.similarity_search(query)\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "6d9c28ad",
"metadata": {},
"source": [
"***\n",
"\n",
"## Other Information"
]
},
{
"cell_type": "markdown",
"id": "18152965",
"metadata": {},
"source": [
"### Similarity search with score"
]
},
{
"cell_type": "markdown",
"id": "346347d7",
"metadata": {},
"source": [
"The returned distance score is cosine distance. Therefore, a lower score is better."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "72aaa9c8",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"docs = db.similarity_search_with_score(query)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "d88e958e",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"(Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'}),\n",
" 1.1972057819366455)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0]"
]
},
{
"cell_type": "markdown",
"id": "794a7552",
"metadata": {},
"source": [
"### Retriever options\n",
"\n",
"This section goes over different options for how to use Chroma as a retriever.\n",
"\n",
"#### MMR\n",
"\n",
"In addition to using similarity search in the retriever object, you can also use `mmr`."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "96ff911a",
"metadata": {},
"outputs": [],
"source": [
"retriever = db.as_retriever(search_type=\"mmr\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "f00be6d0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever.invoke(query)[0]"
]
},
{
"cell_type": "markdown",
"id": "275dbd0a",
"metadata": {},
"source": [
"### Filtering on metadata\n",
"\n",
"It can be helpful to narrow down the collection before working with it.\n",
"\n",
"For example, collections can be filtered on metadata using the get method."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "81600dc1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'ids': [], 'embeddings': None, 'metadatas': [], 'documents': []}"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# filter collection for updated source\n",
"example_db.get(where={\"source\": \"some_other_source\"})"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}