2024¶

2024/12/26
在 UV
5 分钟阅读

迁移到 uv

我们为何迁移到 uv

我们最近从 poetry 迁移到了 uv，因为我们想受益于它的许多特性，例如

更简单的依赖管理，内置自动缓存
与 poetry 相比，CI/CD 显著更快，特别是当我们使用 Astral 团队提供的 caching 功能时
Cargo 风格的锁文件，使得更容易采用新的 PEP 功能

我们花了大约 1-2 天处理迁移，对结果很满意。平均而言，对于 CI/CD，我们的任务速度有了巨大的提升。

这里是我从 CI/CD 运行中获取的一些任务耗时。

总的来说，我认为一旦我们为单个 uv github actions 实现了缓存，我们的任务速度大约提高了 3 倍，所需时间减少了约 67%。

2024/12/11
在 OpenAI, 多模态
5 分钟阅读

使用结构化提取从图像中提取元数据

像 gpt-4o 这样的多模态语言模型擅长处理多模态内容，使我们能够从图像中提取丰富的结构化元数据。

这在时尚等领域特别有价值，我们可以利用这些能力从图像甚至视频中理解用户的风格偏好。在这篇文章中，我们将展示如何使用 instructor 将图像映射到给定的产品分类法，以便我们可以为用户推荐类似的产品。

2024/12/10
在 OpenAI
5 分钟阅读

使用 GPT-4o 生成一致的故事

语言模型难以生成节点数量庞大的一致图。这通常是因为图本身太大，模型无法处理。这会导致模型生成不一致的图，其中包含无效和断开连接的节点等问题。

在本文中，我们将通过一个生成“选择你自己的冒险”故事的简单例子，来看看如何使用两阶段方法通过 gpt-4o 生成复杂的 DAG，从而绕过这一限制。

2024/12/10
在 OpenAI
5 分钟阅读

使用 GPT-4o 生成一致的故事

语言模型难以生成节点数量庞大的一致图。这通常是因为图本身太大，模型无法处理。这会导致模型生成不一致的图，其中包含无效和断开连接的节点等问题。

在本文中，我们将通过一个生成“选择你自己的冒险”故事的简单例子，来看看如何使用两阶段方法通过 gpt-4o 生成复杂的 DAG，从而绕过这一限制。

2024/11/21
在数据分析, 结构化输出
4 分钟阅读

使用结构化输出将混乱的表格转换为整洁的数据

为何这是一个问题？

混乱的数据导出是一个常见问题。无论是表格中的多个表头、使得分析痛苦的隐含关系，甚至是合并的单元格，使用带有结构化输出的 instructor 都可以轻松地将混乱的表格转换为整洁的数据，即使您只有表格的图片，正如我们将在下面看到的。

让我们以以下表格为例。它通过空单元格和隐含的重复使得分析变得不必要的困难。如果我们将其用于数据分析，手动清理将是一场巨大的噩梦。

2024/11/19
在 Writer SDK
3 分钟阅读

Writer 现已支持结构化输出

我们很高兴地宣布 instructor 现在支持 Writer 的企业级 LLM，包括他们最新的 Palmyra X 004 模型。此集成使得利用 Writer 强大的语言模型实现结构化输出和企业 AI 工作流成为可能。

入门指南

首先，确保您已在 Writer 上注册了帐户，并使用本快速入门指南获取了 API 密钥。完成这些步骤后，在终端中运行 pip install instructor[writer] 安装支持 Writer 的 instructor。

请确保将 WRITER_API_KEY 环境变量设置为您的 Writer API 密钥，或者将其作为参数传递给 Writer 构造函数。

2024/11/15
在 Gemini, 文档处理
4 分钟阅读

使用 Gemini 通过结构化输出消除幻觉

在这篇文章中，我们将探讨如何使用 Google 的 Gemini 模型与 Instructor 一起从 PDF 中生成准确的引用。这种方法确保答案基于 PDF 的实际内容，从而降低幻觉的风险。

我们将使用 Nvidia 10k 报告作为示例，您可以在此链接下载该报告。

2024/11/11
在 Gemini, 文档处理
3 分钟阅读

使用 Gemini 通过结构化输出处理 PDF

在这篇文章中，我们将探讨如何使用 Google 的 Gemini 模型与 Instructor 一起分析 Gemini 1.5 Pro 论文并提取结构化摘要。

问题

以编程方式处理 PDF 一直很痛苦。典型的方法都有显著的缺点

PDF 解析库需要复杂的规则且容易出错
OCR 解决方案速度慢且容易出错
专用 PDF API 昂贵且需要额外的集成
LLM 解决方案通常需要复杂的文档分块和嵌入流水线

如果我们能直接将 PDF 交给 LLM 并获取结构化数据，那会怎样？借助 Gemini 的多模态能力和 Instructor 的结构化输出处理，我们完全可以做到这一点。

快速设置

首先，安装所需的包

pip install "instructor[google-generativeai]"

然后，这就是您需要的所有代码

import instructor
import google.generativeai as genai
from google.ai.generativelanguage_v1beta.types.file import File
from pydantic import BaseModel
import time

# Initialize the client
client = instructor.from_gemini(
    client=genai.GenerativeModel(
        model_name="models/gemini-1.5-flash-latest",
    )
)


# Define your output structure
class Summary(BaseModel):
    summary: str


# Upload the PDF
file = genai.upload_file("path/to/your.pdf")

# Wait for file to finish processing
while file.state != File.State.ACTIVE:
    time.sleep(1)
    file = genai.get_file(file.name)
    print(f"File is still uploading, state: {file.state}")

print(f"File is now active, state: {file.state}")
print(file)

resp = client.chat.completions.create(
    messages=[
        {"role": "user", "content": ["Summarize the following file", file]},
    ],
    response_model=Summary,
)

print(resp.summary)

展开查看原始结果

summary="Gemini 1.5 Pro is a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. It achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Gemini 1.5 Pro is built to handle extremely long contexts; it has the ability to recall and reason over fine-grained information from up to at least 10M tokens. This scale is unprecedented among contemporary large language models (LLMs), and enables the processing of long-form mixed-modality inputs including entire collections of documents, multiple hours of video, and almost five days long of audio. Gemini 1.5 Pro surpasses Gemini 1.0 Pro and performs at a similar level to 1.0 Ultra on a wide array of benchmarks while requiring significantly less compute to train. It can recall information amidst distractor context, and it can learn to translate a new language from a single set of linguistic documentation. With only instructional materials (a 500-page reference grammar, a dictionary, and ≈ 400 extra parallel sentences) all provided in context, Gemini 1.5 Pro is capable of learning to translate from English to Kalamang, a Papuan language with fewer than 200 speakers, and therefore almost no online presence."