Gemini¶

2024/11/15
在 Gemini, 文档处理中
4 分钟阅读

使用 Gemini 的结构化输出消除幻觉

在本文中，我们将探讨如何使用 Google 的 Gemini 模型与 Instructor 来从 PDF 生成准确的引用。这种方法确保答案基于 PDF 的实际内容，从而降低幻觉的风险。

对于此示例，我们将使用 Nvidia 10k 报告，您可以在此链接下载。

2024/11/11
在 Gemini, 文档处理中
3 分钟阅读

使用 Gemini 的结构化输出处理 PDF

在本文中，我们将探讨如何使用 Google 的 Gemini 模型与 Instructor 来分析 Gemini 1.5 Pro 论文并提取结构化摘要。

问题

以编程方式处理 PDF 总是很痛苦。典型的方法都有明显的缺点

PDF 解析库需要复杂的规则且容易出错
OCR 解决方案缓慢且容易出错
专业的 PDF API 昂贵且需要额外的集成
LLM 解决方案通常需要复杂的文档分块和 embedding 流水线

如果我们能直接将 PDF 交给 LLM 并获得结构化数据呢？凭借 Gemini 的多模态能力和 Instructor 的结构化输出处理，我们可以做到这一点。

快速设置

首先，安装所需的包

pip install "instructor[google-generativeai]"

然后，这里是您需要的所有代码

import instructor
import google.generativeai as genai
from google.ai.generativelanguage_v1beta.types.file import File
from pydantic import BaseModel
import time

# Initialize the client
client = instructor.from_gemini(
    client=genai.GenerativeModel(
        model_name="models/gemini-1.5-flash-latest",
    )
)


# Define your output structure
class Summary(BaseModel):
    summary: str


# Upload the PDF
file = genai.upload_file("path/to/your.pdf")

# Wait for file to finish processing
while file.state != File.State.ACTIVE:
    time.sleep(1)
    file = genai.get_file(file.name)
    print(f"File is still uploading, state: {file.state}")

print(f"File is now active, state: {file.state}")
print(file)

resp = client.chat.completions.create(
    messages=[
        {"role": "user", "content": ["Summarize the following file", file]},
    ],
    response_model=Summary,
)

print(resp.summary)

展开查看原始结果

summary="Gemini 1.5 Pro is a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. It achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Gemini 1.5 Pro is built to handle extremely long contexts; it has the ability to recall and reason over fine-grained information from up to at least 10M tokens. This scale is unprecedented among contemporary large language models (LLMs), and enables the processing of long-form mixed-modality inputs including entire collections of documents, multiple hours of video, and almost five days long of audio. Gemini 1.5 Pro surpasses Gemini 1.0 Pro and performs at a similar level to 1.0 Ultra on a wide array of benchmarks while requiring significantly less compute to train. It can recall information amidst distractor context, and it can learn to translate a new language from a single set of linguistic documentation. With only instructional materials (a 500-page reference grammar, a dictionary, and ≈ 400 extra parallel sentences) all provided in context, Gemini 1.5 Pro is capable of learning to translate from English to Kalamang, a Papuan language with fewer than 200 speakers, and therefore almost no online presence."

优势

Gemini 和 Instructor 的结合相比传统的 PDF 处理方法具有几个关键优势

简单集成 - 与需要复杂文档处理流程、分块策略和 embedding 数据库的传统方法不同，您只需几行代码即可直接处理 PDF。这大大减少了开发时间和维护开销。

结构化输出 - Instructor 的 Pydantic 集成确保您获得所需的确切数据结构。模型的输出会自动验证并带有类型，从而更容易构建可靠的应用程序。如果提取失败，Instructor 会自动为您处理重试，并支持使用 tenacity 的自定义重试逻辑。

多模态支持 - Gemini 的多模态能力意味着这种方法适用于各种文件类型。您可以在同一个 API 请求中处理图像、视频和音频文件。查看我们的多模态处理指南，了解我们如何从旅游视频中提取结构化数据。

结论

处理 PDF 不必复杂。

通过将 Gemini 的多模态能力与 Instructor 的结构化输出处理相结合，我们可以将复杂的文档处理转化为简单的 Python 式代码。

不再需要与解析规则斗争、管理 embedding 或构建复杂的流水线——只需定义您的数据模型，让 LLM 完成繁重的工作。

如果您喜欢这篇内容，请立即尝试 instructor，看看结构化输出如何让使用 LLM 变得如此简单。立即开始使用 Instructor！

2024/10/23
在 Gemini, 多模态中
5 分钟阅读

使用多模态 Gemini 的结构化输出

在本文中，我们将探讨如何使用 Google 的 Gemini 模型与 Instructor 来分析旅游视频并提取结构化推荐。这种强大的组合使我们能够处理多模态输入（视频）并使用 Pydantic 模型生成结构化输出。本文是与 Kino.ai 合作完成的，Kino.ai 是一家使用 instructor 从多模态输入中进行结构化提取以改善电影制片人搜索的公司。

设置环境

首先，让我们使用必要的库设置环境