跳到内容

字段验证

本指南介绍如何在使用 Instructor 提取结构化数据时为字段添加验证。字段验证可确保提取的数据满足特定条件和约束。

为何字段验证很重要

字段验证有助于您

  1. 确保数据质量和一致性
  2. 强制执行业务规则
  3. 防止下游处理出错
  4. 为无效数据提供清晰反馈

Instructor 使用 Pydantic 的验证系统,该系统在提取过程中自动应用。

基本字段约束

您可以使用 Pydantic 的 Field 函数为字段添加基本约束

from pydantic import BaseModel, Field
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class User(BaseModel):
    name: str = Field(..., min_length=2, max_length=50)
    age: int = Field(..., ge=0, le=120)  # greater than or equal to 0, less than or equal to 120
    email: str = Field(..., pattern=r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')

# Extract with validation
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "I'm John Smith, 35 years old, with email john@example.com"}
    ],
    response_model=User
)

常见的字段约束包括

约束 描述 示例
min_length 最小字符串长度 min_length=2
max_length 最大字符串长度 max_length=50
pattern 要匹配的正则表达式模式 pattern=r'^[0-9]+$'
gt 大于 gt=0 (用于数字)
ge 大于或等于 ge=18
lt 小于 lt=100
le 小于或等于 le=120
min_items 最小列表项数 min_items=1
max_items 最大列表项数 max_items=10

有关字段定义的更多信息,请参阅字段概念页面。

使用字段验证器进行验证

对于更复杂的验证逻辑,请使用 Pydantic 的 field_validator 装饰器

from pydantic import BaseModel, Field, field_validator
import instructor
from openai import OpenAI
import re

client = instructor.from_openai(OpenAI())

class Product(BaseModel):
    name: str
    sku: str
    price: float

    @field_validator('name')
    @classmethod
    def validate_name(cls, v):
        if len(v.strip()) < 3:
            raise ValueError("Product name must be at least 3 characters")
        return v.strip()

    @field_validator('sku')
    @classmethod
    def validate_sku(cls, v):
        if not re.match(r'^[A-Z]{3}-\d{4}$', v):
            raise ValueError("SKU must be in format XXX-0000")
        return v

    @field_validator('price')
    @classmethod
    def validate_price(cls, v):
        if v <= 0:
            raise ValueError("Price must be greater than zero")
        if v > 10000:
            raise ValueError("Price exceeds maximum allowed value")
        return v

# Extract validated data
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "Product: Wireless Headphones, SKU: ABC-1234, Price: $79.99"}
    ],
    response_model=Product
)

字段验证器可以: - 执行复杂的验证逻辑 - 清理和规范化数据 - 转换值 - 根据外部数据源检查值

有关自定义验证器的更多信息,请参阅自定义验证器指南。

模型级验证

有时验证需要检查字段之间的关系。为此,请使用 model_validator

from pydantic import BaseModel, Field, model_validator
import instructor
from openai import OpenAI
from datetime import date

client = instructor.from_openai(OpenAI())

class DateRange(BaseModel):
    start_date: date
    end_date: date

    @model_validator(mode='after')
    def validate_date_range(self):
        if self.end_date < self.start_date:
            raise ValueError("End date must be after start date")
        return self

嵌套结构中的验证

您可以在嵌套结构的任何层级应用验证

from pydantic import BaseModel, Field, field_validator
import instructor
from openai import OpenAI
from typing import List

client = instructor.from_openai(OpenAI())

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

    @field_validator('state')
    @classmethod
    def validate_state(cls, v):
        valid_states = {"CA", "NY", "TX", "FL"}  # Example: just a few states
        if v not in valid_states:
            raise ValueError(f"State must be one of: {', '.join(valid_states)}")
        return v

    @field_validator('zip_code')
    @classmethod
    def validate_zip(cls, v):
        if not v.isdigit() or len(v) != 5:
            raise ValueError("ZIP code must be 5 digits")
        return v

class Person(BaseModel):
    name: str
    addresses: List[Address]  # Nested structure with validation

有关嵌套结构的更多信息,请参阅嵌套结构指南。

列表项验证

您可以验证列表中的项

from typing import List
from pydantic import BaseModel, Field, field_validator
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class TagList(BaseModel):
    tags: List[str] = Field(..., min_items=1, max_items=5)

    @field_validator('tags')
    @classmethod
    def validate_tags(cls, tags):
        # Convert all tags to lowercase
        tags = [tag.lower() for tag in tags]

        # Check for minimum length of each tag
        for tag in tags:
            if len(tag) < 2:
                raise ValueError("Each tag must be at least 2 characters")

        # Check for duplicates
        if len(tags) != len(set(tags)):
            raise ValueError("Tags must be unique")

        return tags

有关列表的更多信息,请参阅列表提取指南。

使用枚举进行验证

枚举提供了一种根据预定义值集验证字段的方法

from enum import Enum
from pydantic import BaseModel
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class Status(str, Enum):
    PENDING = "pending"
    APPROVED = "approved"
    REJECTED = "rejected"

class Priority(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"

class Task(BaseModel):
    title: str
    description: str
    status: Status  # Must be one of the enum values
    priority: Priority  # Must be one of the enum values

# Extract with enum validation
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "Task: Update website, Description: Refresh content on homepage, Status: pending, Priority: high"}
    ],
    response_model=Task
)

有关枚举的更多信息,请参阅枚举概念页面。

自定义错误消息

您可以自定义验证错误消息以提供更好的反馈

from pydantic import BaseModel, Field
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class CreditCard(BaseModel):
    number: str = Field(
        ..., 
        pattern=r'^\d{16}$',
        json_schema_extra={"error_msg": "Credit card number must be exactly 16 digits"}
    )
    expiry_month: int = Field(
        ..., 
        ge=1, 
        le=12,
        json_schema_extra={"error_msg": "Expiry month must be between 1 and 12"}
    )
    expiry_year: int = Field(
        ..., 
        ge=2023, 
        le=2030,
        json_schema_extra={"error_msg": "Expiry year must be between 2023 and 2030"}
    )
    cvv: str = Field(
        ..., 
        pattern=r'^\d{3,4}$',
        json_schema_extra={"error_msg": "CVV must be 3 or 4 digits"}
    )

处理验证失败

当验证失败时,Instructor 将会

  1. 捕获验证错误
  2. 将错误消息添加到上下文
  3. 使用此反馈重试请求(如果启用了重试)

控制重试行为

client = instructor.from_openai(
    OpenAI(),
    max_retries=2,  # Number of retries after the initial attempt
    throw_error=True  # Whether to raise an exception on validation failure
)

有关重试的更多信息,请参阅重试机制指南。

实际示例:表单数据验证

这是一个更完整的表单输入验证示例

from pydantic import BaseModel, Field, field_validator, model_validator
import instructor
from openai import OpenAI
import re
from datetime import date, datetime
from typing import Optional

client = instructor.from_openai(OpenAI())

class RegistrationForm(BaseModel):
    username: str = Field(..., min_length=3, max_length=20)
    email: str
    password: str
    confirm_password: str
    birth_date: date

    @field_validator('username')
    @classmethod
    def validate_username(cls, v):
        if not re.match(r'^[a-zA-Z0-9_]+$', v):
            raise ValueError("Username can only contain letters, numbers, and underscores")
        return v

    @field_validator('email')
    @classmethod
    def validate_email(cls, v):
        if not re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', v):
            raise ValueError("Invalid email format")
        return v

    @field_validator('password')
    @classmethod
    def validate_password(cls, v):
        if len(v) < 8:
            raise ValueError("Password must be at least 8 characters")
        if not re.search(r'[A-Z]', v):
            raise ValueError("Password must contain at least one uppercase letter")
        if not re.search(r'[a-z]', v):
            raise ValueError("Password must contain at least one lowercase letter")
        if not re.search(r'[0-9]', v):
            raise ValueError("Password must contain at least one number")
        return v

    @field_validator('birth_date')
    @classmethod
    def validate_age(cls, v):
        today = date.today()
        age = today.year - v.year - ((today.month, today.day) < (v.month, v.day))
        if age < 18:
            raise ValueError("You must be at least 18 years old to register")
        return v

    @model_validator(mode='after')
    def passwords_match(self):
        if self.password != self.confirm_password:
            raise ValueError("Passwords do not match")
        return self

下一步