批量生成合成数据¶

本教程演示如何使用 instructor 利用 OpenAI 新的 Batch API 大规模生成大量合成数据。在此示例中，我们将使用 ms-marco 数据集生成合成问题来评估 RAG 检索。

为什么使用 Batch API？

你可能想使用 Batch API 的几个原因：

Batch 作业比按需运行推理作业便宜 50%（请参阅 OpenAI 的定价页面此处）
Batch 作业比普通 API 调用具有更高的速率限制
Batch 作业同时支持普通模型和微调模型

这使得它们非常适合涉及大量数据的非时间敏感任务。

入门¶

首先，我们来看看如何使用 Instructor 通过普通的 OpenAI 函数调用生成问答对。

from pydantic import BaseModel, Field
from openai import OpenAI
from instructor import from_openai

client = from_openai(OpenAI())


class QuestionAnswerPair(BaseModel):
    """
    This model represents a pair of a question generated from a text chunk, its corresponding answer,
    and the chain of thought leading to the answer. The chain of thought provides insight into how the answer
    was derived from the question.
    """

    chain_of_thought: str = Field(
        description="The reasoning process leading to the answer."
    )
    question: str = Field(description="The generated question from the text chunk.")
    answer: str = Field(description="The answer to the generated question.")


def generate_question(chunk: str) -> QuestionAnswerPair:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a world class AI that excels at generating hypothethical search queries. You're about to be given a text snippet and asked to generate a search query which is specific to the specific text chunk that you'll be given. Make sure to use information from the text chunk.",
            },
            {"role": "user", "content": f"Here is the text chunk: {chunk}"},
        ],
        response_model=QuestionAnswerPair,
    )


text_chunk = """
The Reserve Bank of Australia (RBA) came into being on 14 January 1960 as Australia 's central bank and banknote issuing authority, when the Reserve Bank Act 1959 removed the central banking functions from the Commonwealth Bank. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.
"""
print(generate_question(text_chunk).model_dump_json(indent=2))
"""
{
  "chain_of_thought": "The text discusses the formation of the Reserve Bank of Australia (RBA) and provides key details about its establishment date, the removal of central banking functions from the Commonwealth Bank, its asset worth, and its employee distribution. By focusing on these details, a search query can be framed around the establishment date and purpose of the RBA.",
  "question": "When was the Reserve Bank of Australia established and what are its main functions?",
  "answer": "The Reserve Bank of Australia was established on 14 January 1960 as Australia's central bank and banknote issuing authority."
}
"""

随着我们希望为其生成这些合成问题的块数增加，成本将按比例增长。

让我们看看如何使用 BatchJob 对象创建一个与 Batch API 兼容的 .jsonl 文件。

from datasets import load_dataset
from instructor.batch import BatchJob
from pydantic import BaseModel, Field
from datasets import load_dataset

dataset = load_dataset("ms_marco", "v1.1", split="train", streaming=True).take(200)


def get_messages(dataset):  # (1)!
    for row in dataset:
        for passage in row['passages']['passage_text']:
            yield [
                {
                    "role": "system",
                    "content": "You are a world class AI that excels at generating hypothethical search queries. You're about to be given a text snippet and asked to generate a search query which is specific to the specific text chunk that you'll be given. Make sure to use information from the text chunk.",
                },
                {"role": "user", "content": f"Here is the text chunk: {passage}"},
            ]


class QuestionAnswerPair(BaseModel):
    """
    This model represents a pair of a question generated from a text chunk, its corresponding answer,
    and the chain of thought leading to the answer. The chain of thought provides insight into how the answer
    was derived from the question.
    """

    chain_of_thought: str = Field(
        description="The reasoning process leading to the answer."
    )
    question: str = Field(description="The generated question from the text chunk.")
    answer: str = Field(description="The answer to the generated question.")


BatchJob.create_from_messages(
    messages_batch=get_messages(dataset),
    model="gpt-4o",
    file_path="./test.jsonl",
    response_model=QuestionAnswerPair,
)  # (2)!

我们首先定义一个生成器，它生成一个消息列表，这些消息是我们通常在 openai API 调用中创建的。
然后我们使用 create_from_messages 类方法指定我们想要的模型和 response_model。instructor 将在后台处理 openai schema 的生成，并将输出写入您指定的文件路径。

一旦我们有了这个新的 .jsonl 文件，我们就可以使用新的 instructor CLI 的 batch 命令来创建一个新的 batch 作业。

> % ls -a | grep test.jsonl
test.jsonl

> % instructor batch create-from-file --file-path test.jsonl

这将创建一个如下所示的表格。在我的例子中，我的 batch 作业花了大约 6 分钟完成，运行成本为 2.72 美元。

Batch ID	创建时间	状态	失败数	完成数	总数
batch_Z8XUudoweH43R9c4sr4wRYub	2024-07-16 12:45:22	进行中	0	483	1627

一旦我们的 batch 作业完成，状态将变为 completed。

取消作业

如果您想中途取消 batch 作业，也可以使用 instructor batch CLI 命令进行操作。

instructor batch cancel --batch-id <batch id here>

然后，我们可以使用 CLI 命令下载 batch 作业生成的文件。

instructor batch download-file --download-file-path output.jsonl --batch-id batch_Z8XUudoweH43R9c4sr4wRYub

这将在您指定的路径创建一个包含生成内容的 .jsonl 文件。

解析生成的响应¶

然后，我们可以使用 BatchJob 类提供的 .parse_from_file 命令解析生成的响应。

from instructor.batch import BatchJob
from pydantic import BaseModel, Field



class QuestionAnswerPair(BaseModel):
    """
    This model represents a pair of a question generated from a text chunk, its corresponding answer,
    and the chain of thought leading to the answer. The chain of thought provides insight into how the answer
    was derived from the question.
    """

    chain_of_thought: str = Field(
        description="The reasoning process leading to the answer."
    )
    question: str = Field(description="The generated question from the text chunk.")
    answer: str = Field(description="The answer to the generated question.")


parsed, unparsed = BatchJob.parse_from_file(  # (1)!
    file_path="./output.jsonl", response_model=QuestionAnswerPair
)

print(len(parsed))
#> 0
print(len(unparsed))
#> 0

然后，我们可以使用通用的 Pydantic schema 将生成的函数调用解析回来。

这将返回一个包含两个元素的列表

parsed 是一个列表，包含已成功解析到 QuestionAnswerPair Base Model 类中的响应
unparsed 是第二个列表，包含未能解析到 QuestionAnswerPair Base Model 类中的响应