多模态¶
我们提供了几个不同的示例文件供您用于测试这些新功能。以下所有示例都使用这些文件。
- (图片) : 一些蓝莓植物的图片 image.jpg
- (音频) : 原版葛底斯堡演说的录音 : gettysburg.wav
- (PDF) : 一个包含虚假发票的示例 PDF 文件 invoice.pdf Instructor 提供一个统一的、与提供者无关的接口,用于处理图像和 PDF 等多模态输入。
Instructor 提供一个统一的、与提供者无关的接口,用于处理图像、PDF 和音频文件等多模态输入。
借助 Instructor 的多模态对象,您可以使用跨不同 AI 提供者(OpenAI、Anthropic、Mistral 等)的统一 API,轻松地从 URL、本地文件或 base64 字符串加载媒体。
Instructor 在幕后处理所有特定于提供者的格式要求,确保您的代码随着提供者 API 的演进而保持整洁和面向未来。让我们看看如何使用 Image、Audio 和 PDF 类。
Image
¶
此类表示可以从 URL 或文件路径加载的图像。它提供了一组方法来从不同来源(例如,URL、路径和 base64 字符串)创建 Image
实例。下表显示了各个提供者支持的方法。
方法 | OpenAI | Anthropic | Google GenAI |
---|---|---|---|
from_url() | ✅ | ✅ | ✅ |
from_path() | ✅ | ✅ | ✅ |
from_base64() | ✅ | ✅ | ✅ |
autodetect() | ✅ | ✅ | ✅ |
我们也支持 Anthropic 对带有 `ImageWith` 的图片进行提示词缓存
用法¶
通过使用 Image
类,我们可以抽象不同格式之间的差异,让您可以使用统一的接口进行操作。
您可以使用 from_url
或 from_path
方法从 URL 或文件路径创建 Image
实例。Image
类将自动将图像转换为 base64 编码字符串并将其包含在 API 请求中。
import instructor
from instructor.multimodal import Image
import openai
from pydantic import BaseModel
class ImageDescription(BaseModel):
description: str
items: list[str]
# Use our sample image provided above.
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/image.jpg"
client = instructor.from_openai(openai.OpenAI())
response = client.chat.completions.create(
model="gpt-4o-mini",
response_model=ImageDescription,
messages=[
{
"role": "user",
"content": [
"What is in this image?",
Image.from_url(url),
],
}
],
)
print(response)
# > description='A bush with numerous clusters of blueberries surrounded by green leaves, under a cloudy sky.' items=['blueberries', 'green leaves', 'cloudy sky']
我们还提供了一个 autodetect_image
关键字参数,当您将其设置为 true 时,您可以将 URL 或文件路径作为普通字符串提供。
您可以查看下面的示例。
import instructor
from instructor.multimodal import Image
import openai
from pydantic import BaseModel
class ImageDescription(BaseModel):
description: str
items: list[str]
# Download a sample image for demonstration
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/image.jpg"
client = instructor.from_openai(openai.OpenAI())
response = client.chat.completions.create(
model="gpt-4o-mini",
response_model=ImageDescription,
autodetect_images=True, # Set this to True
messages=[
{
"role": "user",
"content": ["What is in this image?", url],
}
],
)
print(response)
# > description='A bush with numerous clusters of blueberries surrounded by green leaves, under a cloudy sky.' items=['blueberries', 'green leaves', 'cloudy sky']
如果您想支持 Anthropic 对图像进行提示词缓存,我们提供了 ImageWithCacheControl
对象来实现。只需使用 from_image_params
方法,即可利用 Anthropic 的提示词缓存。
import instructor
from instructor.multimodal import ImageWithCacheControl
import anthropic
from pydantic import BaseModel
class ImageDescription(BaseModel):
description: str
items: list[str]
# Download a sample image for demonstration
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/image.jpg"
client = instructor.from_anthropic(anthropic.Anthropic())
response, completion = client.chat.completions.create_with_completion(
model="claude-3-5-sonnet-20240620",
response_model=ImageDescription,
autodetect_images=True, # Set this to True
messages=[
{
"role": "user",
"content": [
"What is in this image?",
ImageWithCacheControl.from_image_params(
{
"source": url,
"cache_control": {
"type": "ephemeral",
},
}
),
],
}
],
max_tokens=1000,
)
print(response)
# > description='A bush with numerous clusters of blueberries surrounded by green leaves, under a cloudy sky.' items=['blueberries', 'green leaves', 'cloudy sky']
print(completion.usage.cache_creation_input_tokens)
# > 1820
通过利用 Instructor 的多模态功能,您可以专注于构建应用程序逻辑,而无需担心每个提供者图像处理格式的复杂细节。这不仅节省了开发时间,还使您的代码更易于维护,更能适应未来 AI 提供者 API 的变化。
Audio
¶
注意:目前只有 OpenAI 和 Gemini 支持音频文件。对于 Gemini,我们为该功能以字节形式传递原始字节。如果您想改用
Files
API,我们也支持,请在此处阅读更多内容了解如何操作。
与 Image 类类似,我们提供创建 Audio
实例的方法。
方法 | OpenAI | Google GenAI |
---|---|---|
from_url() | ✅ | ✅ |
from_path() | ✅ | ✅ |
Audio
类表示可以从 URL 或文件路径加载的音频文件。它提供了使用 from_path
和 from_url
方法创建 Audio
实例的方法。
Audio
类将自动将其转换为正确的格式并包含在 API 请求中。
from openai import OpenAI
from pydantic import BaseModel
import instructor
from instructor.multimodal import Audio
import base64
# Initialize the client
client = instructor.from_openai(OpenAI())
# Define our response model
class AudioDescription(BaseModel):
summary: str
transcript: str
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/gettysburg.wav"
# Make the API call with the audio file
resp = client.chat.completions.create(
model="gpt-4o-audio-preview",
response_model=AudioDescription,
modalities=["text"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": [
"Extract the following information from the audio:",
Audio.from_url(url),
],
},
],
)
print(resp)
PDF
¶
PDF
类表示可以从 URL 或文件路径加载的 PDF 文件。
它提供了创建 PDF
实例的方法,并且目前支持 OpenAI、Mistral、GenAI 和 Anthropic 客户端集成。
方法 | OpenAI | Anthropic | Google GenAI | Mistral |
---|---|---|---|---|
from_url() | ✅ | ✅ | ✅ | ✅ |
from_path() | ✅ | ✅ | ✅ | ❎ |
from_base64() | ✅ | ✅ | ✅ | ❎ |
autodetect() | ✅ | ✅ | ✅ | ✅ |
对于 Gemini,我们还提供了两个额外的方法,使使用 google-genai 文件包变得容易,您可以在 PDFWithGenaiFile
对象中访问它们。
对于 Anthropic,您可以使用 PDFWithCacheControl
对象启用缓存。请注意,此对象默认配置了缓存,以便于使用。
我们在下方提供了如何使用这三个对象类的示例。
用法¶
from openai import OpenAI
import instructor
from pydantic import BaseModel
from instructor.multimodal import PDF
# Set up the client
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/invoice.pdf"
client = instructor.from_openai(OpenAI())
# Create a model for analyzing PDFs
class Invoice(BaseModel):
total: float
items: list[str]
# Load and analyze a PDF
response = client.chat.completions.create(
model="gpt-4o-mini",
response_model=Invoice,
messages=[
{
"role": "user",
"content": [
"Analyze this document",
PDF.from_url(url),
],
}
],
)
print(response)
# > Total = 220, items = ['English Tea', 'Tofu']
缓存¶
如果您想为 Anthropic 缓存 PDF,我们提供了 PDFWithCacheControl
类,该类默认配置了缓存。
from anthropic import Anthropic
import instructor
from pydantic import BaseModel
from instructor.multimodal import PDFWithCacheControl
# Set up the client
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/invoice.pdf"
client = instructor.from_anthropic(Anthropic())
# Create a model for analyzing PDFs
class Invoice(BaseModel):
total: float
items: list[str]
# Load and analyze a PDF
response, completion = client.chat.completions.create_with_completion(
model="claude-3-5-sonnet-20240620",
response_model=Invoice,
messages=[
{
"role": "user",
"content": [
"Analyze this document",
PDFWithCacheControl.from_url(url),
],
}
],
max_tokens=1000,
)
print(response)
# > Total = 220, items = ['English Tea', 'Tofu']
print(completion.usage.cache_creation_input_tokens)
# > 2091
使用文件¶
我们还提供了 Files API 的便捷包装器 - 允许您使用已上传的文件并阻止主线程直到文件上传完成。
在下面的示例中,我们下载示例 PDF,然后使用 google.genai
sdk 提供的 Files
api 上传它。
from google.genai import Client
import instructor
from pydantic import BaseModel
from instructor.multimodal import PDFWithGenaiFile
import requests
# Set up the client
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/invoice.pdf"
client = instructor.from_genai(Client())
with requests.get(url) as response:
pdf_data = response.content
with open("./invoice.pdf", "wb") as f:
f.write(pdf_data)
# Create a model for analyzing PDFs
class Invoice(BaseModel):
total: float
items: list[str]
# Load and analyze a PDF
response = client.chat.completions.create(
model="gemini-2.0-flash",
response_model=Invoice,
messages=[
{
"role": "user",
"content": [
"Analyze this document",
PDFWithGenaiFile.from_new_genai_file(
file_path="./invoice.pdf",
retry_delay=10,
max_retries=20,
),
],
}
],
)
print(response)
# > Total = 220, items = ['English Tea', 'Tofu']
如果您已经提前上传了文件,我们也支持。只需提供文件名即可,如下所示
from google.genai import Client
import instructor
from pydantic import BaseModel
from instructor.multimodal import PDFWithGenaiFile
import requests
# Set up the client
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/invoice.pdf"
client = instructor.from_genai(Client())
with requests.get(url) as response:
pdf_data = response.content
with open("./invoice.pdf", "wb") as f:
f.write(pdf_data)
file = client.files.upload(
file="invoice.pdf",
)
# Create a model for analyzing PDFs
class Invoice(BaseModel):
total: float
items: list[str]
# Load and analyze a PDF
response = client.chat.completions.create(
model="gemini-2.0-flash",
response_model=Invoice,
messages=[
{
"role": "user",
"content": [
"Analyze this document",
PDFWithGenaiFile.from_existing_genai_file(file_name=file.name),
],
}
],
)
print(response)
# > Total = 220, items = ['English Tea', 'Tofu']
这样您就可以对文件上传方式有更细粒度的控制,可能还可以同时处理多个文件上传。