字段验证¶
本指南介绍如何在使用 Instructor 提取结构化数据时为字段添加验证。字段验证可确保提取的数据满足特定条件和约束。
为何字段验证很重要¶
字段验证有助于您
- 确保数据质量和一致性
- 强制执行业务规则
- 防止下游处理出错
- 为无效数据提供清晰反馈
Instructor 使用 Pydantic 的验证系统,该系统在提取过程中自动应用。
基本字段约束¶
您可以使用 Pydantic 的 Field
函数为字段添加基本约束
from pydantic import BaseModel, Field
import instructor
from openai import OpenAI
client = instructor.from_openai(OpenAI())
class User(BaseModel):
name: str = Field(..., min_length=2, max_length=50)
age: int = Field(..., ge=0, le=120) # greater than or equal to 0, less than or equal to 120
email: str = Field(..., pattern=r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')
# Extract with validation
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": "I'm John Smith, 35 years old, with email john@example.com"}
],
response_model=User
)
常见的字段约束包括
约束 | 描述 | 示例 |
---|---|---|
min_length | 最小字符串长度 | min_length=2 |
max_length | 最大字符串长度 | max_length=50 |
pattern | 要匹配的正则表达式模式 | pattern=r'^[0-9]+$' |
gt | 大于 | gt=0 (用于数字) |
ge | 大于或等于 | ge=18 |
lt | 小于 | lt=100 |
le | 小于或等于 | le=120 |
min_items | 最小列表项数 | min_items=1 |
max_items | 最大列表项数 | max_items=10 |
有关字段定义的更多信息,请参阅字段概念页面。
使用字段验证器进行验证¶
对于更复杂的验证逻辑,请使用 Pydantic 的 field_validator
装饰器
from pydantic import BaseModel, Field, field_validator
import instructor
from openai import OpenAI
import re
client = instructor.from_openai(OpenAI())
class Product(BaseModel):
name: str
sku: str
price: float
@field_validator('name')
@classmethod
def validate_name(cls, v):
if len(v.strip()) < 3:
raise ValueError("Product name must be at least 3 characters")
return v.strip()
@field_validator('sku')
@classmethod
def validate_sku(cls, v):
if not re.match(r'^[A-Z]{3}-\d{4}$', v):
raise ValueError("SKU must be in format XXX-0000")
return v
@field_validator('price')
@classmethod
def validate_price(cls, v):
if v <= 0:
raise ValueError("Price must be greater than zero")
if v > 10000:
raise ValueError("Price exceeds maximum allowed value")
return v
# Extract validated data
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": "Product: Wireless Headphones, SKU: ABC-1234, Price: $79.99"}
],
response_model=Product
)
字段验证器可以: - 执行复杂的验证逻辑 - 清理和规范化数据 - 转换值 - 根据外部数据源检查值
有关自定义验证器的更多信息,请参阅自定义验证器指南。
模型级验证¶
有时验证需要检查字段之间的关系。为此,请使用 model_validator
from pydantic import BaseModel, Field, model_validator
import instructor
from openai import OpenAI
from datetime import date
client = instructor.from_openai(OpenAI())
class DateRange(BaseModel):
start_date: date
end_date: date
@model_validator(mode='after')
def validate_date_range(self):
if self.end_date < self.start_date:
raise ValueError("End date must be after start date")
return self
嵌套结构中的验证¶
您可以在嵌套结构的任何层级应用验证
from pydantic import BaseModel, Field, field_validator
import instructor
from openai import OpenAI
from typing import List
client = instructor.from_openai(OpenAI())
class Address(BaseModel):
street: str
city: str
state: str
zip_code: str
@field_validator('state')
@classmethod
def validate_state(cls, v):
valid_states = {"CA", "NY", "TX", "FL"} # Example: just a few states
if v not in valid_states:
raise ValueError(f"State must be one of: {', '.join(valid_states)}")
return v
@field_validator('zip_code')
@classmethod
def validate_zip(cls, v):
if not v.isdigit() or len(v) != 5:
raise ValueError("ZIP code must be 5 digits")
return v
class Person(BaseModel):
name: str
addresses: List[Address] # Nested structure with validation
有关嵌套结构的更多信息,请参阅嵌套结构指南。
列表项验证¶
您可以验证列表中的项
from typing import List
from pydantic import BaseModel, Field, field_validator
import instructor
from openai import OpenAI
client = instructor.from_openai(OpenAI())
class TagList(BaseModel):
tags: List[str] = Field(..., min_items=1, max_items=5)
@field_validator('tags')
@classmethod
def validate_tags(cls, tags):
# Convert all tags to lowercase
tags = [tag.lower() for tag in tags]
# Check for minimum length of each tag
for tag in tags:
if len(tag) < 2:
raise ValueError("Each tag must be at least 2 characters")
# Check for duplicates
if len(tags) != len(set(tags)):
raise ValueError("Tags must be unique")
return tags
有关列表的更多信息,请参阅列表提取指南。
使用枚举进行验证¶
枚举提供了一种根据预定义值集验证字段的方法
from enum import Enum
from pydantic import BaseModel
import instructor
from openai import OpenAI
client = instructor.from_openai(OpenAI())
class Status(str, Enum):
PENDING = "pending"
APPROVED = "approved"
REJECTED = "rejected"
class Priority(str, Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
class Task(BaseModel):
title: str
description: str
status: Status # Must be one of the enum values
priority: Priority # Must be one of the enum values
# Extract with enum validation
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": "Task: Update website, Description: Refresh content on homepage, Status: pending, Priority: high"}
],
response_model=Task
)
有关枚举的更多信息,请参阅枚举概念页面。
自定义错误消息¶
您可以自定义验证错误消息以提供更好的反馈
from pydantic import BaseModel, Field
import instructor
from openai import OpenAI
client = instructor.from_openai(OpenAI())
class CreditCard(BaseModel):
number: str = Field(
...,
pattern=r'^\d{16}$',
json_schema_extra={"error_msg": "Credit card number must be exactly 16 digits"}
)
expiry_month: int = Field(
...,
ge=1,
le=12,
json_schema_extra={"error_msg": "Expiry month must be between 1 and 12"}
)
expiry_year: int = Field(
...,
ge=2023,
le=2030,
json_schema_extra={"error_msg": "Expiry year must be between 2023 and 2030"}
)
cvv: str = Field(
...,
pattern=r'^\d{3,4}$',
json_schema_extra={"error_msg": "CVV must be 3 or 4 digits"}
)
处理验证失败¶
当验证失败时,Instructor 将会
- 捕获验证错误
- 将错误消息添加到上下文
- 使用此反馈重试请求(如果启用了重试)
控制重试行为
client = instructor.from_openai(
OpenAI(),
max_retries=2, # Number of retries after the initial attempt
throw_error=True # Whether to raise an exception on validation failure
)
有关重试的更多信息,请参阅重试机制指南。
实际示例:表单数据验证¶
这是一个更完整的表单输入验证示例
from pydantic import BaseModel, Field, field_validator, model_validator
import instructor
from openai import OpenAI
import re
from datetime import date, datetime
from typing import Optional
client = instructor.from_openai(OpenAI())
class RegistrationForm(BaseModel):
username: str = Field(..., min_length=3, max_length=20)
email: str
password: str
confirm_password: str
birth_date: date
@field_validator('username')
@classmethod
def validate_username(cls, v):
if not re.match(r'^[a-zA-Z0-9_]+$', v):
raise ValueError("Username can only contain letters, numbers, and underscores")
return v
@field_validator('email')
@classmethod
def validate_email(cls, v):
if not re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', v):
raise ValueError("Invalid email format")
return v
@field_validator('password')
@classmethod
def validate_password(cls, v):
if len(v) < 8:
raise ValueError("Password must be at least 8 characters")
if not re.search(r'[A-Z]', v):
raise ValueError("Password must contain at least one uppercase letter")
if not re.search(r'[a-z]', v):
raise ValueError("Password must contain at least one lowercase letter")
if not re.search(r'[0-9]', v):
raise ValueError("Password must contain at least one number")
return v
@field_validator('birth_date')
@classmethod
def validate_age(cls, v):
today = date.today()
age = today.year - v.year - ((today.month, today.day) < (v.month, v.day))
if age < 18:
raise ValueError("You must be at least 18 years old to register")
return v
@model_validator(mode='after')
def passwords_match(self):
if self.password != self.confirm_password:
raise ValueError("Passwords do not match")
return self