使用微调 GPT-3.5 和 Chain of Density 实现更智能的摘要¶

了解如何使用 Instructor 将 Chain Of Density 这样的迭代方法提炼成一个单一的微调模型

在本文中，我们将指导您如何使用 Instructor 实现原始的 Chain of Density 方法，然后展示如何提炼 GPT 3.5 模型以匹敌 GPT-4 的迭代摘要能力。使用这些方法，我们能够将延迟降低 20 倍，成本降低 50 倍，并保持实体密度。

最后，您将获得一个 GPT 3.5 模型（使用 Instructor 优秀的工具进行微调），该模型能够生成与 Chain of Density [Adams 等人 (2023)] 有着匹敌效果的摘要。一如既往，所有代码都已提供在我们的仓库中的 examples/chain-of-density 文件夹中供您参考。

数据集和 Colab Notebook

我们还将所有生成的数据上传到了 Hugging Face 这里，如果您想尝试重现这些实验，可以使用这些数据。我们还添加了一个 Colab 实例，供您检查我们生成的值。

第 1 部分) Chain of Density¶

使用 AI 总结大量文本可能具有挑战性，通常依赖于不一致的技术。他们的新方法，Chain Of Density 提示，增强了基于 AI 的文本摘要能力，表现优于人工生成的摘要。

最初，AI 生成一份摘要，然后通过多次迭代进行细化，添加遗漏的文章实体。每次迭代都会向摘要添加新的文章实体，同时保持长度一致，从而得到一个实体密集、信息丰富的摘要，称为 Chain Of Density。

该方法首次在论文《从稀疏到密集：使用 Chain of Density 提示进行 GPT-4 摘要》中提出。研究团队发现，这种方法能够持续超越人工标注员撰写的类似摘要。

实现细节

请注意，我们的实现使用验证器来确保重写摘要具有最小长度，而不是通过提示。我们也只进行了 3 轮而不是 5 轮重写，因此最终的实体密度较低。

原始提示¶

我们可以将原始过程分解成更小的 API 调用。这使我们能够在每个步骤引入验证，以确保我们获得期望的结果。

原始 Chain of Density 提示

Article: {{ARTICLE}}

You will generate increasingly concise, entity-dense summaries of the
above Article.

Repeat the following 2 steps 5 times.

Step 1. Identify 1-3 informative Entities (";" delimited) from the
Article which are missing from the previously generated summary.
Step 2. Write a new, denser summary of identical length which covers
every entity and detail from the previous summary plus the Missing
Entities.

A Missing Entity is:
- Relevant: to the main story.
- Specific: descriptive yet concise (5 words or fewer).
- Novel; not in the previous summary.
- Faithful: present in the Article.
- Anywhere: located anywhere in the Article.

Guidelines:
- The first summary should be long (4-5 sentences, -80 words) yet
highly non-specific, containing little information beyond the
entities marked as missing. Use overly verbose language and fillers
(e.g., "this article discusses") to reach -80 words.
- Make every word count: re-write the previous summary to improve
flow and make space for additional entities.
- Make space with fusion, compression, and removal of uninformative
phrases like "the article discusses"
- The summaries should become highly dense and concise yet
self-contained, e.g., easily understood without the Article.
- Missing entities can appear anywhere in the new summary.
- Never drop entities from the previous summary. If space cannot be
made, add fewer new entities.

Remember, use the exact same number of words for each summary.

Answer in JSON. The JSON should be a list (length 5) of dictionaries
whose keys are "Missing_Entities" and "Denser_Summary"

数据建模¶

在开始数据建模之前，请确保我们已安装所有依赖项

pip install instructor aiohttp rich

初始摘要¶

让我们首先了解一些我们将用作 OpenAI 函数调用的 response_model 的数据模型

首先，我们需要一个数据模型来表示我们将生成的初始摘要。我们将直接使用原始提示中的类描述。重要的是要注意，这些文档字符串是有目的的，它们在 LLM 生成输出时会被 直接使用。

关于文档字符串的快速说明

在底层，Instructor 会将您提供的 response_model 解析成 OpenAI 执行的函数调用。这意味着最终输出将与您指定的 Pydantic 模型紧密相关。

例如，我们在微调中稍后使用的这个简单模型。

class GeneratedSummary(BaseModel):
    """
    This represents a highly concise summary that includes as many entities as possible from the original source article.

    An Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.

    Guidelines
    - Make every word count
    - The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article.
    - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses"
    """

    summary: str = Field(
        ...,
        description="This represents the final summary generated that captures the meaning of the original article which is as concise as possible. ",
    )

我们最终将其转换为如下所示的 OpenAI 函数调用。

{
"functions": [
    {
    "name": "GeneratedSummary",
    "description": "This represents a highly concise summary that includes as many entities as possible from the original source article.\n\nAn Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.\n\nGuidelines\n- Make every word count\n- The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article.\n- Make space with fusion, compression, and removal of uninformative phrases like \"the article discusses\"",
    "parameters": {
        "type": "object",
        "properties": {
        "summary": {
            "description": "This represents the final summary generated that captures the meaning of the original article which is as concise as possible. ",
            "title": "Summary",
            "type": "string"
        }
        },
        "required": [
        "summary"
        ]

    }
    }
]
}
}

因此，这意味着您的描述越精细详细，您能够获得的输出就越好。但这并非全部，由于底层完全是 Pydantic，您可以验证和解析生成的输出，确保它 完全符合您的指定。从头到尾都是 Python。

class InitialSummary(BaseModel):
    """
    This is an initial summary which should be long ( 4-5 sentences, ~80 words)
    yet highly non-specific, containing little information beyond the entities marked as missing.
    Use overly verbose languages and fillers (Eg. This article discusses) to reach ~80 words.
    """

    summary: str = Field(
        ...,
        description="This is a summary of the article provided which is overly verbose and uses fillers. It should be roughly 80 words in length",
    )

重写摘要¶

我们还需要一个额外的类来帮助建模重写后的架构

class RewrittenSummary(BaseModel):
    """
    This is a new, denser summary of identical length which covers every entity
    and detail from the previous summary plus the Missing Entities.

    Guidelines
    - Make every word count : Rewrite the previous summary to improve flow and make space for additional entities
    - Never drop entities from the previous summary. If space cannot be made, add fewer new entities.
    - The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article.
    - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses"
    - Missing entities can appear anywhere in the new summary

    An Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.
    """

    summary: str = Field(
        ...,
        description="This is a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities. It should have the same length ( ~ 80 words ) as the previous summary and should be easily understood without the Article",
    )
    absent: List[str] = Field(
        ...,
        default_factory=list,
        description="this is a list of Entities found absent from the new summary that were present in the previous summary",
    )
    missing: List[str] = Field(
        default_factory=list,
        description="This is a list of 1-3 informative Entities from the Article that are missing from the new summary which should be included in the next generated summary.",
    )

在 Instructor 中使用 Pydantic 验证器

要更深入地了解如何在 Instructor 库中使用 Pydantic 验证器，我们建议您查看我们之前关于 LLM 验证的文章——好的 LLM 验证就是好的验证

理想情况下，我们希望 Missing 的长度在 1 到 3 之间，Absent 是一个空列表，并且我们重写的摘要保持最小实体密度。使用 Instructor，我们可以通过声明为类本身一部分的本地 Pydantic 验证器来实现此逻辑。

import nltk
import spacy

nlp = spacy.load("en_core_web_sm")

@field_validator("summary")
def min_length(cls, v: str):
    tokens = nltk.word_tokenize(v) #(1)!
    num_tokens = len(tokens)
    if num_tokens < 60:
        raise ValueError(
            "The current summary is too short. Please make sure that you generate a new summary that is around 80 words long."
        )
    return v

@field_validator("missing")
def has_missing_entities(cls, missing_entities: List[str]):
    if len(missing_entities) == 0:
        raise ValueError(
            "You must identify 1-3 informative Entities from the Article which are missing from the previously generated summary to be used in a new summary"
        )
    return missing_entities

@field_validator("absent")
def has_no_absent_entities(cls, absent_entities: List[str]):
    absent_entity_string = ",".join(absent_entities)
    if len(absent_entities) > 0:
        print(f"Detected absent entities of {absent_entity_string}")
        raise ValueError(
            f"Do not omit the following Entities {absent_entity_string} from the new summary"
        )
    return absent_entities

@field_validator("summary")
def min_entity_density(cls, v: str):
    tokens = nltk.word_tokenize(v)
    num_tokens = len(tokens)

    # Extract Entities
    doc = nlp(v) #(2)!
    num_entities = len(doc.ents)

    density = num_entities / num_tokens
    if density < 0.08: #(3)!
        raise ValueError(
            f"The summary of {v} has too few entities. Please regenerate a new summary with more new entities added to it. Remember that new entities can be added at any point of the summary."
        )

    return v

与原始论文类似，我们利用 NLTK 词语分词器来计算生成句子中的 token 数量。我们目标是生成的摘要至少有 60 个 token，这样就不会丢失信息。
我们还使用 spaCy 库来计算生成摘要的实体密度。
我们还实现了最小实体密度，以便保持在给定范围内。在本例中，0.08 是随意选择的。

整合所有部分¶

现在我们已经有了模型并大致确定了流程，接下来让我们实现一个函数来使用 Chain Of Density 摘要方法对一段文本进行摘要。

from openai import OpenAI
import instructor

client = instructor.from_openai(OpenAI()) #(1)!

def summarize_article(article: str, summary_steps: int = 3):
    summary_chain = []
    # We first generate an initial summary
    summary: InitialSummary = client.chat.completions.create(  # (2)!
        model="gpt-4-0613",
        response_model=InitialSummary,
        messages=[
            {
                "role": "system",
                "content": "Write a summary about the article that is long (4-5 sentences) yet highly non-specific. Use overly, verbose language and fillers(eg.,'this article discusses') to reach ~80 words",
            },
            {"role": "user", "content": f"Here is the Article: {article}"},
            {
                "role": "user",
                "content": "The generated summary should be about 80 words.",
            },
        ],
        max_retries=2,
    )
    prev_summary = None
    summary_chain.append(summary.summary)
    for i in range(summary_steps):
        missing_entity_message = (
            []
            if prev_summary is None
            else [
                {
                    "role": "user",
                    "content": f"Please include these Missing Entities: {','.join(prev_summary.missing)}",
                },
            ]
        )
        new_summary: RewrittenSummary = client.chat.completions.create( # (3)!
            model="gpt-4-0613",
            messages=[
                {
                    "role": "system",
                    "content": """
                You are going to generate an increasingly concise,entity-dense summary of the following article.

                Perform the following two tasks
                - Identify 1-3 informative entities from the following article which is missing from the previous summary
                - Write a new denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities

                Guidelines
                - Make every word count: re-write the previous summary to improve flow and make space for additional entities
                - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
                - The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article.
                - Missing entities can appear anywhere in the new summary
                - Never drop entities from the previous summary. If space cannot be made, add fewer new entities.
                """,
                },
                {"role": "user", "content": f"Here is the Article: {article}"},
                {
                    "role": "user",
                    "content": f"Here is the previous summary: {summary_chain[-1]}",
                },
                *missing_entity_message,
            ],
            max_retries=3, #(4)!
            max_tokens=1000,
            response_model=RewrittenSummary,
        )
        summary_chain.append(new_summary.summary)
        prev_summary = new_summary

    return summary_chain

我们需要在 OpenAI 客户端上应用一个 patch 函数，以便获得 Instructor 提供的所有好处。通过简单的 patch，我们可以开箱即用地获得 输出的自动类型转换和无效输出的自动重试！
我们首先生成一个初始摘要。请注意，我们在系统提示中明确要求摘要包含 80 个词，并且篇幅较长，包含过多冗余填充词。
我们稍微修改了原始论文中使用的原始系统提示，以执行摘要的重写。使用 Instructor，我们还可以通过上面定义的 field_validator 对生成的输出进行验证。
如果您选择的值大于 0.08，请确保增加此值，以防需要进行多次重写。

这个摘要函数生成的结果在保持相同 token 数量的同时，将实体数量增加了两倍。我们还可以看到，从风格上看，摘要更加自然。

第一次迭代

本文讨论了备受期待的曼尼·帕奎奥和弗洛伊德·梅威瑟之间的拳击比赛。文章围绕曼尼·帕奎奥关于即将到来的比赛的声明以及他的备战情况展开。文章的一部分详细介绍了比赛的财务规定及其在体育领域的重要性。文章重点突出了帕奎奥展示其决心和比赛策略的引述。文章的基调主要集中在为即将到来的大型赛事造势。

最后一次迭代

菲律宾拳击手曼尼·帕奎奥预计即将在美高梅大酒店举行的 5 月 2 日对决将是他一生中最重要的一场比赛，对手是未尝败绩的美国选手弗洛伊德·梅威瑟，这是一场价值 3 亿美元的比赛。尽管在这场高风险的拉斯维加斯比赛中被视为弱者，帕奎奥仍然充满信心，承诺将展现战士精神，并向等待了这场对决十年的拳迷们保证，这将确实是有史以来最盛大的体育盛事，值得他们的期待。

第 2 部分) 微调¶

在本节中，我们将探讨如何微调一个 GPT 3.5 模型，使其能够达到与 GPT-4 模型相当的水平。然后，我们将把我们模型的性能与 GPT-4 进行比较，看看它的表现如何。

创建训练集¶

为了防止在测试期间出现数据污染，我们从 griffin/chain-of-density 数据集中随机采样了 120 篇文章，并将这些文章分割成 train.csv 和 test.csv 文件，然后上传到了 Hugging Face。现在，我们只需要从 Instructor 包中导入 Instructions 模块，该模块允许您生成一个格式良好的 .jsonl 文件用于微调。

from typing import List
from chain_of_density import summarize_article #(1)!
import csv
import logging
import instructor
from pydantic import BaseModel
from openai import OpenAI

client = instructor.from_openai(OpenAI()) # (2)!

logging.basicConfig(level=logging.INFO) #(3)!

instructions = instructor.Instructions( #(4)!
    name="Chain Of Density",
    finetune_format="messages",
    # log handler is used to save the data to a file
    # you can imagine saving it to a database or other storage
    # based on your needs!
    log_handlers=[logging.FileHandler("generated.jsonl")],
    openai_client=client,
)

class GeneratedSummary(BaseModel):
    """
    This represents a highly concise summary that includes as many entities as possible from the original source article.

    An Entity is a real-world object that's assigned a name - for example, a person, country a product or a book title.

    Guidelines
    - Make every word count
    - The new summary should be highly dense and concise yet self-contained, eg., easily understood without the Article.
    - Make space with fusion, compression, and removal of uninformative phrases like "the article discusses"
    """

    summary: str = Field(
        ...,
        description="This represents the final summary generated that captures the meaning of the original article which is as concise as possible. ",
    )

@instructions.distil #(4)!
def distil_summarization(text: str) -> GeneratedSummary:
    summary_chain: List[str] = summarize_article(text)
    return GeneratedSummary(summary=summary_chain[-1]) #(5)!

with open("train.csv", "r") as file:
    reader = csv.reader(file)
    next(reader)  # Skip the header
    for article, summary in reader:
        # Run Distillisation to generate the values
        distil_summarization(article)

在这个例子中，我们使用的是上面定义的 summarize_article。我们将其保存在一个名为 chain_of_density.py 的本地文件中，因此需要导入它。
我们修补默认的 OpenAI 客户端，以便能够与其一起使用 Instructor 库。
我们还需要配置 INFO 级别的日志记录。这非常重要，如果未配置，将不会生成输出。
我们实例化一个 Instruction 对象，它将帮助我们将函数调用转换为有效的 .jsonl 文件。我们还在 log_handlers 参数中定义了 .jsonl 文件的名称。
我们添加 instructions.distil 注解，这样我们就可以自动捕获我们希望模型微调以输出的函数的输入和输出。
我们返回一个与我们在函数上使用的注解相匹配的 Pydantic 对象。请注意，当使用 instructions.distil 注解时，我们必须指定返回一个 Pydantic 对象。

速率限制

我们建议先在数据集的一个小部分上运行此脚本，以测试所有配置是否正确。在运行任何后续命令之前，别忘了使用 tenacity 添加速率限制错误处理，并设置 OPENAI_API_KEY shell 环境变量。

创建微调任务¶

运行此脚本后，我们的本地仓库中将生成一个名为 generated.jsonl 的新文件。现在剩下的就是运行以下命令来开始微调您的第一个模型！

instructor jobs create-from-file generated.jsonl

微调参考

查看我们的微调 CLI，了解其他可用于提高模型性能的超参数。

任务完成后，我们只需将上面原始文件中函数调用中的注解更改为 distil_summarization，即可开始使用我们的新模型。

@instructions.distil(model='gpt-3.5-turbo:finetuned-123', mode="dispatch")  # (1)!
def distil_summarization(text: str) -> GeneratedSummary:
    summary_chain: List[str] = summarize_article(text)
    return GeneratedSummary(summary=summary_chain[-1])

别忘了将其替换为您的新模型 ID。OpenAI 使用以下 ID 标识微调模型：ft:gpt-3.5-turbo-0613:personal:在其控制面板的“微调”选项卡下

至此，您已经拥有了自己的微调模型，可以投入生产并提供数据服务。我们已经看到了 Instructor 如何让您的工作更轻松，从微调到提炼。

结果与基准¶

我们将使用 20 篇未用于微调的文章，从 3 个方面比较以下模型。

实体密度：这是每个 token 的实体数量，密度越高越好。
延迟：生成最后一个 token 所需的时间（秒）
成本：生成输出的总成本 - 我们将成本分解为训练成本和推理成本，以便于参考。

3.5 微调模型 (n): 这是我们在 n 个示例上微调的 GPT 3.5 模型。每个模型都微调了 4-5 个 epoch（这是由 OpenAI 调度器自动决定的）。
GPT-4 (COD): 这是一个我们应用了 3 轮 Chain Of Density 重写以使用上述方法生成摘要的 GPT-4 模型。
GPT-3.5 (原生): 这是一个我们要求生成实体密集且简洁摘要的 GPT 3.5 模型。摘要通过单次传递生成，目标 token 数量约为 80-90 个。

模型	平均延迟 (秒)	平均实体密度
3.5 微调模型 (20)	2.1	0.15
3.5 微调模型 (50)	2.1	0.14
3.5 微调模型 (76)	2.1	0.14
GPT-3.5 (原生)	16.8	0.12
GPT-4 (COD)	49.5	0.15

微调数据集

对于我们的微调模型，我们进行了一些优化以提高性能。

我们仅在数据集中包含了最小密度为 0.15 的摘要，将整个链中密度最高的摘要作为最终摘要，强制每个重新生成的摘要具有最小密度 0.12，并且如果摘要不符合要求，最多重新生成三次。这是一种昂贵得多的策略，成本可能比本教程中的方法高出 2.5 倍或更多

由于严格的要求，生成 75 个示例的总成本为 63.46 美元，相当于每个生成的摘要示例约 0.85 美元。

使用 OpenAI 用量控制面板，我们可以计算生成 20 份摘要的成本，如下所示。

模型	训练成本 (美元)	推理成本 (美元)	使用的 Token	总成本 (美元)
GPT-3.5 (原生)	-	0.20	51,162	0.2
3.5 微调模型 (20)	0.7	0.20	56,573	0.8
3.5 微调模型 (50)	1.4	0.17	49,057	1.3
3.5 微调模型 (76)	1.8	0.17	51,583	2.5
GPT-4 (COD)	-	12.9	409,062	12.9

在这里，我们可以看到 GPT-4 的每份摘要推理成本约为 0.65 美元，而我们的微调模型每份摘要推理成本为 0.0091 美元，大约便宜了 72 倍。

有趣的是，使用最少示例进行微调的模型似乎优于其他模型。虽然原因未知，但一些可能的原因可能是我们没有训练足够的 epoch（我们选择了默认的 5 个 epoch），或者模型开始学习模仿其他行为，例如从更多样的样本中学习更抽象的写作风格，导致实体密度下降。

结论¶

微调这种迭代方法的速度提高了 20-40 倍，同时提高了整体性能，通过将能力微调和提炼到专用模型中，实现了巨大的效率提升。

我们已经看到了 Instructor 如何让您的工作更轻松，从数据建模到提炼和微调。如果您喜欢这些内容或者想尝试 instructor，请访问 github 并别忘了给我们一个 star！