使用 Pydantic 验证 LLM 引用¶

确保信息的准确性至关重要。本文探讨了 Pydantic 强大而灵活的验证器如何通过引用验证来提高数据准确性。

我们将从使用简单的子字符串检查来验证引用开始。然后，我们将使用 instructor 本身驱动一个 LLM 来验证引用并将答案与给定引用对齐。最后，我们将探讨如何使用这些技术来生成准确响应的数据集。

示例 1：简单子字符串检查¶

在此示例中，我们使用 Statements 类来验证给定的子字符串引用是否存在于文本块中。如果未找到该子字符串，则会引发错误。

代码示例：¶

from typing import List
from openai import OpenAI
from pydantic import BaseModel, ValidationInfo, field_validator
import instructor

client = instructor.from_openai(OpenAI())


class Statements(BaseModel):
    body: str
    substring_quote: str

    @field_validator("substring_quote")
    @classmethod
    def substring_quote_exists(cls, v: str, info: ValidationInfo):
        context = info.context.get("text_chunks", None)

        for text_chunk in context.values():
            if v in text_chunk:  # (1)
                return v
        raise ValueError("Could not find substring_quote `{v}` in contexts")


class AnswerWithCitaton(BaseModel):
    question: str
    answer: List[Statements]

虽然在此示例中我们使用了简单的子字符串检查，但我们可以使用更复杂的技术，例如正则表达式或 Levenshtein 距离。

定义类后，我们可以使用它来验证上下文并在未找到子字符串时引发错误。

try:
    AnswerWithCitaton.model_validate(
        {
            "question": "What is the capital of France?",
            "answer": [
                {"body": "Paris", "substring_quote": "Paris is the capital of France"},
            ],
        },
        context={
            "text_chunks": {
                1: "Jason is a pirate",
                2: "Paris is not the capital of France",
                3: "Irrelevant data",
            }
        },
    )
except ValidationError as e:
    print(e)

错误消息示例：¶

answer.0.substring_quote
  Value error, Could not find substring_quote `Paris is the capital of France` in contexts [type=value_error, input_value='Paris is the capital of France', input_type=str]
    For further information visit [https://errors.pydantic.dev/2.4/v/value_error](https://errors.pydantic.dev/2.4/v/value_error)

当上下文不存在 substring_quote 属性时，Pydantic 会引发验证错误。这种方法可用于使用正则表达式或 Levenshtein 距离等技术验证更复杂的数据。

示例 2：使用 LLM 进行验证¶

此方法利用 OpenAI 的 LLM 来验证引用。如果上下文中不存在该引用，LLM 将返回错误消息。

代码示例：¶

class Validation(BaseModel):
    is_valid: bool
    error_messages: Optional[str] = Field(None, description="Error messages if any")


class Statements(BaseModel):
    body: str
    substring_quote: str

    @model_validator(mode="after")
    def substring_quote_exists(self, info: ValidationInfo):
        context = info.context.get("text_chunks", None)

        resp: Validation = client.chat.completions.create(
            response_model=Validation,
            messages=[
                {
                    "role": "user",
                    "content": f"Does the following citation exist in the following context?\n\nCitation: {self.substring_quote}\n\nContext: {context}",
                }
            ],
            model="gpt-3.5-turbo",
        )

        if resp.is_valid:
            return self

        raise ValueError(resp.error_messages)


class AnswerWithCitaton(BaseModel):
    question: str
    answer: List[Statements]

现在，当我们使用正确的引用时，LLM 会返回有效的响应。

resp = AnswerWithCitaton.model_validate(
    {
        "question": "What is the capital of France?",
        "answer": [
            {"body": "Paris", "substring_quote": "Paris is the capital of France"},
        ],
    },
    context={
        "text_chunks": {
            1: "Jason is a pirate",
            2: "Paris is the capital of France",
            3: "Irrelevant data",
        }
    },
)
print(resp.model_dump_json(indent=2))

结果：¶

{
  "question": "What is the capital of France?",
  "answer": [
    {
      "body": "Paris",
      "substring_quote": "Paris is the capital of France"
    }
  ]
}

当上下文中不存在引用时，LLM 会返回错误消息。

try:
    AnswerWithCitaton.model_validate(
        {
            "question": "What is the capital of France?",
            "answer": [
                {"body": "Paris", "substring_quote": "Paris is the capital of France"},
            ],
        },
        context={
            "text_chunks": {
                1: "Jason is a pirate",
                2: "Paris is not the capital of France",
                3: "Irrelevant data",
            }
        },
    )
except ValidationError as e:
    print(e)

错误消息示例：¶

1 validation error for AnswerWithCitaton
answer.0
  Value error, Citation not found in context [type=value_error, input_value={'body': 'Paris', 'substr... the capital of France'}, input_type=dict]
    For further information visit [https://errors.pydantic.dev/2.4/v/value_error](https://errors.pydantic.dev/2.4/v/value_error)

示例 3：对齐引用和答案¶

在此示例中，我们确保提供的答案与给定引用和上下文对齐。使用 LLM 来验证对齐情况。

我们使用与上面相同的 Statements 模型，但我们为答案添加了一个新模型，该模型也验证引用的对齐情况。

代码示例：¶

class AnswerWithCitaton(BaseModel):
    question: str
    answer: List[Statements]

    @model_validator(mode="after")
    def validate_answer(self, info: ValidationInfo):
        context = info.context.get("text_chunks", None)

        resp: Validation = client.chat.completions.create(
            response_model=Validation,
            messages=[
                {
                    "role": "user",
                    "content": f"Does the following answers match the question and the context?\n\nQuestion: {self.question}\n\nAnswer: {self.answer}\n\nContext: {context}",
                }
            ],
            model="gpt-3.5-turbo",
        )

        if resp.is_valid:
            return self

        raise ValueError(resp.error_messages)

当答案和引用之间存在不匹配时，LLM 会返回错误消息。

try:
    AnswerWithCitaton.model_validate(
        {
            "question": "What is the capital of France?",
            "answer": [
                {"body": "Texas", "substring_quote": "Paris is the capital of France"},
            ],
        },
        context={
            "text_chunks": {
                1: "Jason is a pirate",
                2: "Paris is the capital of France",
                3: "Irrelevant data",
            }
        },
    )
except ValidationError as e:
    print(e)

错误消息示例：¶

1 validation error for AnswerWithCitaton
  Value error, The answer does not match the question and context [type=value_error, input_value={'question': 'What is the...he capital of France'}]}, input_type=dict]
    For further information visit [https://errors.pydantic.dev/2.4/v/value_error](https://errors.pydantic.dev/2.4/v/value_error)

结论¶

这些示例展示了使用 Pydantic 和 OpenAI 通过引用验证提高数据准确性的潜力。虽然基于 LLM 的方法对于运行时操作可能效率不高，但它对于生成准确响应数据集具有令人兴奋的意义。通过在数据生成过程中利用此方法，我们可以微调一个在引用准确性方面表现出色的模型。类似于我们上一篇关于微调更好的摘要器的文章。

如果您喜欢这些内容，请查看我们的 GitHub，给我们点个星并查看该库。