使用 Pydantic 验证 LLM 引用¶
确保信息的准确性至关重要。本文探讨了 Pydantic 强大而灵活的验证器如何通过引用验证来提高数据准确性。
我们将从使用简单的子字符串检查来验证引用开始。然后,我们将使用 instructor
本身驱动一个 LLM 来验证引用并将答案与给定引用对齐。最后,我们将探讨如何使用这些技术来生成准确响应的数据集。
示例 1:简单子字符串检查¶
在此示例中,我们使用 Statements
类来验证给定的子字符串引用是否存在于文本块中。如果未找到该子字符串,则会引发错误。
代码示例:¶
from typing import List
from openai import OpenAI
from pydantic import BaseModel, ValidationInfo, field_validator
import instructor
client = instructor.from_openai(OpenAI())
class Statements(BaseModel):
body: str
substring_quote: str
@field_validator("substring_quote")
@classmethod
def substring_quote_exists(cls, v: str, info: ValidationInfo):
context = info.context.get("text_chunks", None)
for text_chunk in context.values():
if v in text_chunk: # (1)
return v
raise ValueError("Could not find substring_quote `{v}` in contexts")
class AnswerWithCitaton(BaseModel):
question: str
answer: List[Statements]
- 虽然在此示例中我们使用了简单的子字符串检查,但我们可以使用更复杂的技术,例如正则表达式或 Levenshtein 距离。
定义类后,我们可以使用它来验证上下文并在未找到子字符串时引发错误。
try:
AnswerWithCitaton.model_validate(
{
"question": "What is the capital of France?",
"answer": [
{"body": "Paris", "substring_quote": "Paris is the capital of France"},
],
},
context={
"text_chunks": {
1: "Jason is a pirate",
2: "Paris is not the capital of France",
3: "Irrelevant data",
}
},
)
except ValidationError as e:
print(e)
错误消息示例:¶
answer.0.substring_quote
Value error, Could not find substring_quote `Paris is the capital of France` in contexts [type=value_error, input_value='Paris is the capital of France', input_type=str]
For further information visit [https://errors.pydantic.dev/2.4/v/value_error](https://errors.pydantic.dev/2.4/v/value_error)
当上下文不存在 substring_quote
属性时,Pydantic 会引发验证错误。这种方法可用于使用正则表达式或 Levenshtein 距离等技术验证更复杂的数据。
示例 2:使用 LLM 进行验证¶
此方法利用 OpenAI 的 LLM 来验证引用。如果上下文中不存在该引用,LLM 将返回错误消息。
代码示例:¶
class Validation(BaseModel):
is_valid: bool
error_messages: Optional[str] = Field(None, description="Error messages if any")
class Statements(BaseModel):
body: str
substring_quote: str
@model_validator(mode="after")
def substring_quote_exists(self, info: ValidationInfo):
context = info.context.get("text_chunks", None)
resp: Validation = client.chat.completions.create(
response_model=Validation,
messages=[
{
"role": "user",
"content": f"Does the following citation exist in the following context?\n\nCitation: {self.substring_quote}\n\nContext: {context}",
}
],
model="gpt-3.5-turbo",
)
if resp.is_valid:
return self
raise ValueError(resp.error_messages)
class AnswerWithCitaton(BaseModel):
question: str
answer: List[Statements]
现在,当我们使用正确的引用时,LLM 会返回有效的响应。
resp = AnswerWithCitaton.model_validate(
{
"question": "What is the capital of France?",
"answer": [
{"body": "Paris", "substring_quote": "Paris is the capital of France"},
],
},
context={
"text_chunks": {
1: "Jason is a pirate",
2: "Paris is the capital of France",
3: "Irrelevant data",
}
},
)
print(resp.model_dump_json(indent=2))
结果:¶
{
"question": "What is the capital of France?",
"answer": [
{
"body": "Paris",
"substring_quote": "Paris is the capital of France"
}
]
}
当上下文中不存在引用时,LLM 会返回错误消息。
try:
AnswerWithCitaton.model_validate(
{
"question": "What is the capital of France?",
"answer": [
{"body": "Paris", "substring_quote": "Paris is the capital of France"},
],
},
context={
"text_chunks": {
1: "Jason is a pirate",
2: "Paris is not the capital of France",
3: "Irrelevant data",
}
},
)
except ValidationError as e:
print(e)
错误消息示例:¶
1 validation error for AnswerWithCitaton
answer.0
Value error, Citation not found in context [type=value_error, input_value={'body': 'Paris', 'substr... the capital of France'}, input_type=dict]
For further information visit [https://errors.pydantic.dev/2.4/v/value_error](https://errors.pydantic.dev/2.4/v/value_error)
示例 3:对齐引用和答案¶
在此示例中,我们确保提供的答案与给定引用和上下文对齐。使用 LLM 来验证对齐情况。
我们使用与上面相同的 Statements
模型,但我们为答案添加了一个新模型,该模型也验证引用的对齐情况。
代码示例:¶
class AnswerWithCitaton(BaseModel):
question: str
answer: List[Statements]
@model_validator(mode="after")
def validate_answer(self, info: ValidationInfo):
context = info.context.get("text_chunks", None)
resp: Validation = client.chat.completions.create(
response_model=Validation,
messages=[
{
"role": "user",
"content": f"Does the following answers match the question and the context?\n\nQuestion: {self.question}\n\nAnswer: {self.answer}\n\nContext: {context}",
}
],
model="gpt-3.5-turbo",
)
if resp.is_valid:
return self
raise ValueError(resp.error_messages)
当答案和引用之间存在不匹配时,LLM 会返回错误消息。
try:
AnswerWithCitaton.model_validate(
{
"question": "What is the capital of France?",
"answer": [
{"body": "Texas", "substring_quote": "Paris is the capital of France"},
],
},
context={
"text_chunks": {
1: "Jason is a pirate",
2: "Paris is the capital of France",
3: "Irrelevant data",
}
},
)
except ValidationError as e:
print(e)
错误消息示例:¶
1 validation error for AnswerWithCitaton
Value error, The answer does not match the question and context [type=value_error, input_value={'question': 'What is the...he capital of France'}]}, input_type=dict]
For further information visit [https://errors.pydantic.dev/2.4/v/value_error](https://errors.pydantic.dev/2.4/v/value_error)
结论¶
这些示例展示了使用 Pydantic 和 OpenAI 通过引用验证提高数据准确性的潜力。虽然基于 LLM 的方法对于运行时操作可能效率不高,但它对于生成准确响应数据集具有令人兴奋的意义。通过在数据生成过程中利用此方法,我们可以微调一个在引用准确性方面表现出色的模型。类似于我们上一篇关于微调更好的摘要器的文章。
如果您喜欢这些内容,请查看我们的 GitHub,给我们点个星并查看该库。