docs: add how-to on multi-modal tool calling (#21667)

Can move this to a dedicated multi-modal section if desired.
3 weeks ago · 12b599c47f
parent 5c64c004cc
commit 12b599c47f
2 changed files with 161 additions and 0 deletions
--- a/docs/docs/how_to/index.mdx
+++ b/docs/docs/how_to/index.mdx
@ -172,6 +172,7 @@ LangChain Tools contain a description of the tool (to pass to the language model
 - [How to: add a human in the loop to tool usage](/docs/how_to/tools_human)
 - [How to: do parallel tool use](/docs/how_to/tools_parallel)
 - [How to: handle errors when calling tools](/docs/how_to/tools_error)
+- [How to: call tools using multi-modal data](/docs/how_to/tool_calls_multi_modal)

 ### Agents

--- a/docs/docs/how_to/tool_calls_multi_modal.ipynb
+++ b/docs/docs/how_to/tool_calls_multi_modal.ipynb
@ -0,0 +1,160 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4facdf7f-680e-4d28-908b-2b8408e2a741",
+   "metadata": {},
+   "source": [
+    "# How to call tools with multi-modal data\n",
+    "\n",
+    "Here we demonstrate how to call tools with multi-modal data, such as images.\n",
+    "\n",
+    "Some multi-modal models, such as those that can reason over images or audio, support [tool calling](/docs/concepts/#functiontool-calling) features as well.\n",
+    "\n",
+    "To call tools using such models, simply bind tools to them in the [usual way](/docs/how_to/tool_calling), and invoke the model using content blocks of the desired type (e.g., containing image data).\n",
+    "\n",
+    "Below, we demonstrate examples using [OpenAI](/docs/integrations/platforms/openai) and [Anthropic](/docs/integrations/platforms/anthropic). We will use the same image and tool in all cases. Let's first select an image, and build a placeholder tool that expects as input the string \"sunny\", \"cloudy\", or \"rainy\". We will ask the models to describe the weather in the image."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "0d9fd81a-b7f0-445a-8e3d-cfc2d31fdd59",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from typing import Literal\n",
+    "\n",
+    "from langchain_core.tools import tool\n",
+    "\n",
+    "image_url = \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n",
+    "\n",
+    "\n",
+    "@tool\n",
+    "def weather_tool(weather: Literal[\"sunny\", \"cloudy\", \"rainy\"]) -> None:\n",
+    "    \"\"\"Describe the weather\"\"\"\n",
+    "    pass"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8656018e-c56d-47d2-b2be-71e87827f90a",
+   "metadata": {},
+   "source": [
+    "## OpenAI\n",
+    "\n",
+    "For OpenAI, we can feed the image URL directly in a content block of type \"image_url\":"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "a8819cf3-5ddc-44f0-889a-19ca7b7fe77e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[{'name': 'weather_tool', 'args': {'weather': 'sunny'}, 'id': 'call_mRYL50MtHdeNuNIjSCm5UPmB'}]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain_core.messages import HumanMessage\n",
+    "from langchain_openai import ChatOpenAI\n",
+    "\n",
+    "model = ChatOpenAI(model=\"gpt-4o\").bind_tools([weather_tool])\n",
+    "\n",
+    "message = HumanMessage(\n",
+    "    content=[\n",
+    "        {\"type\": \"text\", \"text\": \"describe the weather in this image\"},\n",
+    "        {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
+    "    ],\n",
+    ")\n",
+    "response = model.invoke([message])\n",
+    "print(response.tool_calls)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e5738224-1109-4bf8-8976-ff1570dd1d46",
+   "metadata": {},
+   "source": [
+    "Note that we recover tool calls with parsed arguments in LangChain's [standard format](/docs/how_to/tool_calling) in the model response."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0cee63ff-e09f-4dd8-8323-912edbde94f6",
+   "metadata": {},
+   "source": [
+    "## Anthropic\n",
+    "\n",
+    "For Anthropic, we can format a base64-encoded image into a content block of type \"image\", as below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "d90c4590-71c8-42b1-99ff-03a9eca8082e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[{'name': 'weather_tool', 'args': {'weather': 'sunny'}, 'id': 'toolu_016m9KfknJqx5fVRYk4tkF6s'}]\n"
+     ]
+    }
+   ],
+   "source": [
+    "import base64\n",
+    "\n",
+    "import httpx\n",
+    "from langchain_anthropic import ChatAnthropic\n",
+    "\n",
+    "image_data = base64.b64encode(httpx.get(image_url).content).decode(\"utf-8\")\n",
+    "\n",
+    "model = ChatAnthropic(model=\"claude-3-sonnet-20240229\").bind_tools([weather_tool])\n",
+    "\n",
+    "message = HumanMessage(\n",
+    "    content=[\n",
+    "        {\"type\": \"text\", \"text\": \"describe the weather in this image\"},\n",
+    "        {\n",
+    "            \"type\": \"image\",\n",
+    "            \"source\": {\n",
+    "                \"type\": \"base64\",\n",
+    "                \"media_type\": \"image/jpeg\",\n",
+    "                \"data\": image_data,\n",
+    "            },\n",
+    "        },\n",
+    "    ],\n",
+    ")\n",
+    "response = model.invoke([message])\n",
+    "print(response.tool_calls)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}