在这篇文章中,我们将探讨如何使用 Google 的 Gemini 模型与 Instructor 一起分析 Gemini 1.5 Pro 论文并提取结构化摘要。
以编程方式处理 PDF 一直很痛苦。典型的方法都有显著的缺点
- PDF 解析库需要复杂的规则且容易出错
- OCR 解决方案速度慢且容易出错
- 专用 PDF API 昂贵且需要额外的集成
- LLM 解决方案通常需要复杂的文档分块和嵌入流水线
如果我们能直接将 PDF 交给 LLM 并获取结构化数据,那会怎样?借助 Gemini 的多模态能力和 Instructor 的结构化输出处理,我们完全可以做到这一点。
首先,安装所需的包
pip install "instructor[google-generativeai]"
然后,这就是您需要的所有代码
import instructor
import google.generativeai as genai
from google.ai.generativelanguage_v1beta.types.file import File
from pydantic import BaseModel
import time
# Initialize the client
client = instructor.from_gemini(
client=genai.GenerativeModel(
model_name="models/gemini-1.5-flash-latest",
)
)
# Define your output structure
class Summary(BaseModel):
summary: str
# Upload the PDF
file = genai.upload_file("path/to/your.pdf")
# Wait for file to finish processing
while file.state != File.State.ACTIVE:
time.sleep(1)
file = genai.get_file(file.name)
print(f"File is still uploading, state: {file.state}")
print(f"File is now active, state: {file.state}")
print(file)
resp = client.chat.completions.create(
messages=[
{"role": "user", "content": ["Summarize the following file", file]},
],
response_model=Summary,
)
print(resp.summary)
展开查看原始结果
summary="Gemini 1.5 Pro is a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. It achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Gemini 1.5 Pro is built to handle extremely long contexts; it has the ability to recall and reason over fine-grained information from up to at least 10M tokens. This scale is unprecedented among contemporary large language models (LLMs), and enables the processing of long-form mixed-modality inputs including entire collections of documents, multiple hours of video, and almost five days long of audio. Gemini 1.5 Pro surpasses Gemini 1.0 Pro and performs at a similar level to 1.0 Ultra on a wide array of benchmarks while requiring significantly less compute to train. It can recall information amidst distractor context, and it can learn to translate a new language from a single set of linguistic documentation. With only instructional materials (a 500-page reference grammar, a dictionary, and ≈ 400 extra parallel sentences) all provided in context, Gemini 1.5 Pro is capable of learning to translate from English to Kalamang, a Papuan language with fewer than 200 speakers, and therefore almost no online presence."
Gemini 和 Instructor 的结合相比传统的 PDF 处理方法具有几个关键优势
简单集成 - 与需要复杂文档处理流水线、分块策略和嵌入数据库的传统方法不同,您只需几行代码即可直接处理 PDF。这极大地减少了开发时间和维护开销。
结构化输出 - Instructor 与 Pydantic 的集成确保您获得所需的确切数据结构。模型的输出会自动验证和类型化,从而更易于构建可靠的应用程序。如果提取失败,Instructor 会自动为您处理重试,并支持使用 tenacity 的自定义重试逻辑。
多模态支持 - Gemini 的多模态能力意味着这种方法适用于各种文件类型。您可以在同一个 API 请求中处理图像、视频和音频文件。查看我们的多模态处理指南,了解我们如何从旅行视频中提取结构化数据。
处理 PDF 不必复杂。
通过结合 Gemini 的多模态能力和 Instructor 的结构化输出处理,我们可以将复杂的文档处理转化为简单的 Pythonic 代码。
不再需要纠结于解析规则、管理嵌入或构建复杂流水线——只需定义您的数据模型,然后让 LLM 完成繁重的工作。
如果您喜欢本文,请立即试用 instructor
,看看结构化输出如何让使用 LLM 变得更加容易。立即开始使用 Instructor!