使用 Instructor 和 Pydantic 构建成对 LLM 评估器¶

在这篇博客文章中，我们将探讨如何使用 Instructor 和 Pydantic 创建一个成对 LLM 评估器。该评估器将评估问题与文本片段之间的相关性，展示了结构化输出在语言模型交互中的实际应用。

简介¶

评估文本相关性是自然语言处理和信息检索中的常见任务。通过利用大型语言模型 (LLM) 和结构化输出，我们可以创建一个系统来判断问题与给定文本之间的相似度或相关性。

设置环境¶

首先，让我们通过必要的导入来设置我们的环境

import instructor
import openai

client = instructor.from_openai(openai.OpenAI())

这里，我们使用 instructor 库，它可以与 OpenAI 的 API 和 Pydantic 无缝集成，用于生成结构化输出。

定义评估模型¶

我们将使用 Pydantic 来定义一个 Judgment 模型，用于规范我们的 LLM 的输出结构

class Judgment(BaseModel):
    thought: str = Field(
        description="The step-by-step reasoning process used to analyze the question and text"
    )
    justification: str = Field(
        description="Explanation for the similarity judgment, detailing key factors that led to the conclusion"
    )
    similarity: bool = Field(
        description="Boolean judgment indicating whether the question and text are similar or relevant (True) or not (False)"
    )

这个模型确保了我们的 LLM 的输出是结构化的，并且包含思考过程、理由和一个布尔值的相似度判断。

创建评估函数¶

接下来，我们将创建一个函数，使用我们的 LLM 来判断问题与文本之间的相关性

def judge_relevance(question: str, text: str) -> Judgment:
    return client.chat.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": """
                    You are tasked with comparing a question and a piece of text to determine if they are relevant to each other or similar in some way. Your goal is to analyze the content, context, and potential connections between the two.

                    To determine if the question and text are relevant or similar, please follow these steps:

                    1. Carefully read and understand both the question and the text.
                    2. Identify the main topic, keywords, and concepts in the question.
                    3. Analyze the text for any mention of these topics, keywords, or concepts.
                    4. Consider any potential indirect connections or implications that might link the question and text.
                    5. Evaluate the overall context and purpose of both the question and the text.

                    As you go through this process, please use a chain of thought approach. Write out your reasoning for each step inside <thought> tags.

                    After your analysis, provide a boolean judgment on whether the question and text are similar or relevant to each other. Use "true" if they are similar or relevant, and "false" if they are not.

                    Before giving your final judgment, provide a justification for your decision. Explain the key factors that led to your conclusion.

                    Please ensure your analysis is thorough, impartial, and based on the content provided.
                """,
            },
            {
                "role": "user",
                "content": """
                    Here is the question:

                    <question>
                    {{question}}
                    </question>

                    Here is the text:
                    <text>
                    {{text}}
                    </text>
                """,
            },
        ],
        response_model=Judgment,
        context={"question": question, "text": text},
    )

这个函数接收一个问题和一段文本作为输入，将它们连同预定义的提示一起发送给 LLM，并返回一个结构化的 Judgment 对象。

测试评估器¶

为了测试我们的成对 LLM 评估器，我们可以创建一组测试对并评估其性能

if __name__ == "__main__":
    test_pairs = [
        {
            "question": "What are the main causes of climate change?",
            "text": "Global warming is primarily caused by human activities, such as burning fossil fuels, deforestation, and industrial processes. These activities release greenhouse gases into the atmosphere, trapping heat and leading to a rise in global temperatures.",
            "is_similar": True,
        },
        # ... (other test pairs)
    ]

    score = 0
    for pair in test_pairs:
        result = judge_relevance(pair["question"], pair["text"])
        if result.similarity == pair["is_similar"]:
            score += 1

    print(f"Score: {score}/{len(test_pairs)}")
    #> Score 9/10

这个测试循环对每对进行评估，并将结果与预定的相似度值进行比较，从而计算出总分。

结论¶

通过结合 Instructor、Pydantic 和 OpenAI 的语言模型，我们创建了一个强大的文本相关性评估工具。这种方法展示了结构化输出在 LLM 应用中的灵活性和强大之处。

我们构建的成对 LLM 评估器可用于各种场景，例如

提高信息检索系统中的搜索相关性
评估问答系统的质量
协助内容推荐算法
自动化部分内容审核流程

在您探索这项技术时，请考虑如何根据您的特定用例进行扩展或调整。结构化输出与大型语言模型的结合为创建智能、可解释的 AI 系统开辟了无限可能。