{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\"Open\n", "\n", "# Maritalk\n", "\n", "## Introduction\n", "\n", "MariTalk is an assistant developed by the Brazilian company [Maritaca AI](https://www.maritaca.ai).\n", "MariTalk is based on language models that have been specially trained to understand Portuguese well.\n", "\n", "This notebook demonstrates how to use MariTalk with LangChain through two examples:\n", "\n", "1. A simple example of how to use MariTalk to perform a task.\n", "2. LLM + RAG: The second example shows how to answer a question whose answer is found in a long document that does not fit within the token limit of MariTalk. For this, we will use a simple searcher (BM25) to first search the document for the most relevant sections and then feed them to MariTalk for answering." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation\n", "First, install the LangChain library (and all its dependencies) using the following command:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install langchain langchain-core langchain-community" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## API Key\n", "You will need an API key that can be obtained from chat.maritaca.ai (\"Chaves da API\" section)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Example 1 - Pet Name Suggestions\n", "\n", "Let's define our language model, ChatMaritalk, and configure it with your API key." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from langchain_community.chat_models import ChatMaritalk\n", "from langchain_core.output_parsers import StrOutputParser\n", "from langchain_core.prompts.chat import ChatPromptTemplate\n", "\n", "llm = ChatMaritalk(\n", " model=\"sabia-2-medium\", # Available models: sabia-2-small and sabia-2-medium\n", " api_key=\"\", # Insert your API key here\n", " temperature=0.7,\n", " max_tokens=100,\n", ")\n", "\n", "output_parser = StrOutputParser()\n", "\n", "chat_prompt = ChatPromptTemplate.from_messages(\n", " [\n", " (\n", " \"system\",\n", " \"You are an assistant specialized in suggesting pet names. Given the animal, you must suggest 4 names.\",\n", " ),\n", " (\"human\", \"I have a {animal}\"),\n", " ]\n", ")\n", "\n", "chain = chat_prompt | llm | output_parser\n", "\n", "response = chain.invoke({\"animal\": \"dog\"})\n", "print(response) # should answer something like \"1. Max\\n2. Bella\\n3. Charlie\\n4. Rocky\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example 2 - RAG + LLM: UNICAMP 2024 Entrance Exam Question Answering System\n", "For this example, we need to install some extra libraries:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install unstructured rank_bm25 pdf2image pdfminer-six pikepdf pypdf unstructured_inference fastapi kaleido uvicorn \"pillow<10.1.0\" pillow_heif -q" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Loading the database\n", "\n", "The first step is to create a database with the information from the notice. For this, we will download the notice from the COMVEST website and segment the extracted text into 500-character windows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from langchain.document_loaders import OnlinePDFLoader\n", "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", "\n", "# Loading the COMVEST 2024 notice\n", "loader = OnlinePDFLoader(\n", " \"https://www.comvest.unicamp.br/wp-content/uploads/2023/10/31-2023-Dispoe-sobre-o-Vestibular-Unicamp-2024_com-retificacao.pdf\"\n", ")\n", "data = loader.load()\n", "\n", "text_splitter = RecursiveCharacterTextSplitter(\n", " chunk_size=500, chunk_overlap=100, separators=[\"\\n\", \" \", \"\"]\n", ")\n", "texts = text_splitter.split_documents(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Creating a Searcher\n", "Now that we have our database, we need a searcher. For this example, we will use a simple BM25 as a search system, but this could be replaced by any other searcher (such as search via embeddings)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from langchain.retrievers import BM25Retriever\n", "\n", "retriever = BM25Retriever.from_documents(texts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Combining Search System + LLM\n", "Now that we have our searcher, we just need to implement a prompt specifying the task and invoke the chain." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from langchain.chains.question_answering import load_qa_chain\n", "\n", "prompt = \"\"\"Baseado nos seguintes documentos, responda a pergunta abaixo.\n", "\n", "{context}\n", "\n", "Pergunta: {query}\n", "\"\"\"\n", "\n", "qa_prompt = ChatPromptTemplate.from_messages([(\"human\", prompt)])\n", "\n", "chain = load_qa_chain(llm, chain_type=\"stuff\", verbose=True, prompt=qa_prompt)\n", "\n", "query = \"Qual o tempo máximo para realização da prova?\"\n", "\n", "docs = retriever.invoke(query)\n", "\n", "chain.invoke(\n", " {\"input_documents\": docs, \"query\": query}\n", ") # Should output something like: \"O tempo máximo para realização da prova é de 5 horas.\"" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }