文档处理¶

2024/11/15
在 Gemini, 文档处理
阅读约 4 分钟

使用 Gemini 通过结构化输出消除幻觉

在这篇文章中，我们将探讨如何将 Google 的 Gemini 模型与 Instructor 结合使用，从 PDF 中生成准确的引用。这种方法确保答案基于 PDF 的实际内容，从而降低幻觉的风险。

我们将使用 Nvidia 的 10k 报告作为示例，您可以在此链接下载。

2024/11/11
在 Gemini, 文档处理
阅读约 3 分钟

使用 Gemini 通过结构化输出处理 PDF

在这篇文章中，我们将探讨如何将 Google 的 Gemini 模型与 Instructor 结合使用，分析Gemini 1.5 Pro Paper并提取结构化摘要。

问题

以编程方式处理 PDF 总是令人头疼。典型的处理方法都存在明显的缺点：

PDF 解析库需要复杂的规则且容易崩溃
OCR 解决方案速度慢且容易出错
专业 PDF API价格昂贵且需要额外集成
LLM 解决方案通常需要复杂的文档分块和嵌入流水线

如果我们能直接将 PDF 交给 LLM 并获得结构化数据呢？借助 Gemini 的多模态能力和 Instructor 的结构化输出处理，我们完全可以做到这一点。

快速设置

首先，安装所需的软件包

pip install "instructor[google-generativeai]"

然后，这里是您需要的全部代码

import instructor
import google.generativeai as genai
from google.ai.generativelanguage_v1beta.types.file import File
from pydantic import BaseModel
import time

# Initialize the client
client = instructor.from_gemini(
    client=genai.GenerativeModel(
        model_name="models/gemini-1.5-flash-latest",
    )
)


# Define your output structure
class Summary(BaseModel):
    summary: str


# Upload the PDF
file = genai.upload_file("path/to/your.pdf")

# Wait for file to finish processing
while file.state != File.State.ACTIVE:
    time.sleep(1)
    file = genai.get_file(file.name)
    print(f"File is still uploading, state: {file.state}")

print(f"File is now active, state: {file.state}")
print(file)

resp = client.chat.completions.create(
    messages=[
        {"role": "user", "content": ["Summarize the following file", file]},
    ],
    response_model=Summary,
)

print(resp.summary)

展开查看原始结果

summary="Gemini 1.5 Pro is a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. It achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Gemini 1.5 Pro is built to handle extremely long contexts; it has the ability to recall and reason over fine-grained information from up to at least 10M tokens. This scale is unprecedented among contemporary large language models (LLMs), and enables the processing of long-form mixed-modality inputs including entire collections of documents, multiple hours of video, and almost five days long of audio. Gemini 1.5 Pro surpasses Gemini 1.0 Pro and performs at a similar level to 1.0 Ultra on a wide array of benchmarks while requiring significantly less compute to train. It can recall information amidst distractor context, and it can learn to translate a new language from a single set of linguistic documentation. With only instructional materials (a 500-page reference grammar, a dictionary, and ≈ 400 extra parallel sentences) all provided in context, Gemini 1.5 Pro is capable of learning to translate from English to Kalamang, a Papuan language with fewer than 200 speakers, and therefore almost no online presence."

优势

Gemini 和 Instructor 的结合与传统的 PDF 处理方法相比，提供了几个关键优势：

简单集成 - 与需要复杂文档处理流水线、分块策略和嵌入数据库的传统方法不同，您只需几行代码即可直接处理 PDF。这显著减少了开发时间和维护开销。

结构化输出 - Instructor 的 Pydantic 集成确保您获得所需的精确数据结构。模型的输出会自动验证和类型化，从而更容易构建可靠的应用程序。如果提取失败，Instructor 会自动为您处理重试，并支持使用 tenacity 的自定义重试逻辑。

多模态支持 - Gemini 的多模态能力意味着同样的方法适用于各种文件类型。您可以在同一个 API 请求中处理图像、视频和音频文件。请查看我们的多模态处理指南，了解我们如何从旅行视频中提取结构化数据。

结论

处理 PDF 不必复杂。

通过将 Gemini 的多模态能力与 Instructor 的结构化输出处理相结合，我们可以将复杂的文档处理转化为简单、符合 Python 习惯的代码。

无需再与解析规则、管理嵌入或构建复杂流水线作斗争——只需定义您的数据模型，然后让 LLM 完成繁重的工作。

如果您喜欢本文，不妨今天就试试 instructor，看看结构化输出如何让使用 LLM 变得更加轻松。立即开始使用 Instructor！