根据用户提供的标签进行批量分类。¶

本教程展示了如何根据用户提供的标签进行分类。当您想要提供允许用户进行某种分类的服务时，这非常有价值。

动机

想象一下，允许用户作为 RAG 应用的一部分上传文档。很多时候，我们可能希望允许用户指定一组现有标签、提供描述，然后为他们进行分类。

定义结构¶

一个简单的方法是允许用户在某种 schema 中定义一组标签，并将其保存在数据库中。这里是一个我们可能使用的 schema 示例

tag_id	name	instructions
0	personal	个人信息
1	phone	电话号码
2	email	电子邮件地址
3	address	地址
4	Other	其他信息

tag_id - 标签的唯一标识符。
name - 标签的名称。
instructions - 标签的描述，可以用作描述标签的 prompt。

实现分类¶

为此，我们将做几件事

我们将使用 instructor 库来修补 openai 库，以便使用 AsyncOpenAI 客户端。
实现一个 Tag 模型，用于验证来自上下文的标签。（这将帮助我们避免虚构上下文中不存在的标签。）
请求和响应的辅助模型。
一个执行分类的异步函数。
一个主函数，使用 asyncio.gather 函数并行运行分类。

如果您想了解更多关于如何进行异步计算的信息，请查看我们关于 AsyncIO 的文章此处。

import openai
import instructor

client = instructor.from_openai(
    openai.AsyncOpenAI(),
)

首先，我们需要导入所有的 Pydantic 和 instructor 代码并使用 AsyncOpenAI 客户端。然后，我们将定义标签模型以及标签指令以提供输入和输出。

这非常有帮助，因为一旦我们使用像 FastAPI 这样的工具创建端点，Pydantic 函数将充当多种工具

为开发者提供的描述
为 IDE 提供的类型提示
为 FastAPI 端点提供的 OpenAPI 文档
为语言模型提供的 Schema 和响应模型。

from typing import List
from pydantic import BaseModel, ValidationInfo, model_validator


class Tag(BaseModel):
    id: int
    name: str

    @model_validator(mode="after")
    def validate_ids(self, info: ValidationInfo):
        context = info.context
        if context:
            tags: List[Tag] = context.get("tags")
            assert self.id in {
                tag.id for tag in tags
            }, f"Tag ID {self.id} not found in context"
            assert self.name in {
                tag.name for tag in tags
            }, f"Tag name {self.name} not found in context"
        return self


class TagWithInstructions(Tag):
    instructions: str


class TagRequest(BaseModel):
    texts: List[str]
    tags: List[TagWithInstructions]


class TagResponse(BaseModel):
    texts: List[str]
    predictions: List[Tag]

让我们深入了解一下 validate_ids 函数的作用。请注意，它的目的是从上下文中提取标签，并确保每个 ID 和名称都存在于标签集合中。这种方法有助于最大限度地减少幻觉。如果我们错误地识别了 ID 或标签，将抛出错误，并且 instructor 将提示语言模型重试，直到成功提取正确的项。

from pydantic import model_validator, ValidationInfo


@model_validator(mode="after")
def validate_ids(self, info: ValidationInfo):
    context = info.context
    if context:
        tags: List[Tag] = context.get("tags")
        assert self.id in {
            tag.id for tag in tags
        }, f"Tag ID {self.id} not found in context"
        assert self.name in {
            tag.name for tag in tags
        }, f"Tag name {self.name} not found in context"
    return self

现在，让我们实现执行分类的函数。此函数将接受单个文本和标签列表，并返回预测的标签。

async def tag_single_request(text: str, tags: List[Tag]) -> Tag:
    allowed_tags = [(tag.id, tag.name) for tag in tags]
    allowed_tags_str = ", ".join([f"`{tag}`" for tag in allowed_tags])

    return await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a world-class text tagging system.",
            },
            {"role": "user", "content": f"Describe the following text: `{text}`"},
            {
                "role": "user",
                "content": f"Here are the allowed tags: {allowed_tags_str}",
            },
        ],
        response_model=Tag,  # Minimizes the hallucination of tags that are not in the allowed tags.
        validation_context={"tags": tags},
    )


async def tag_request(request: TagRequest) -> TagResponse:
    predictions = await asyncio.gather(
        *[tag_single_request(text, request.tags) for text in request.texts]
    )
    return TagResponse(
        texts=request.texts,
        predictions=predictions,
    )

请注意，我们首先定义一个预测标签的单个异步函数，并将其传递到验证上下文中，以最大限度地减少幻觉。

最后，我们将实现主函数，使用 asyncio.gather 函数并行运行分类。

import asyncio

tags = [
    TagWithInstructions(id=0, name="personal", instructions="Personal information"),
    TagWithInstructions(id=1, name="phone", instructions="Phone number"),
    TagWithInstructions(id=2, name="email", instructions="Email address"),
    TagWithInstructions(id=3, name="address", instructions="Address"),
    TagWithInstructions(id=4, name="Other", instructions="Other information"),
]

# Texts will be a range of different questions.
# Such as "How much does it cost?", "What is your privacy policy?", etc.
texts = [
    "What is your phone number?",
    "What is your email address?",
    "What is your address?",
    "What is your privacy policy?",
]

# The request will contain the texts and the tags.
request = TagRequest(texts=texts, tags=tags)

# The response will contain the texts, the predicted tags, and the confidence.
response = asyncio.run(tag_request(request))
print(response.model_dump_json(indent=2))
"""
{
  "texts": [
    "What is your phone number?",
    "What is your email address?",
    "What is your address?",
    "What is your privacy policy?"
  ],
  "predictions": [
    {
      "id": 1,
      "name": "phone"
    },
    {
      "id": 2,
      "name": "email"
    },
    {
      "id": 3,
      "name": "address"
    },
    {
      "id": 4,
      "name": "Other"
    }
  ]
}
"""

结果如下

{
  "texts": [
    "What is your phone number?",
    "What is your email address?",
    "What is your address?",
    "What is your privacy policy?"
  ],
  "predictions": [
    {
      "id": 1,
      "name": "phone"
    },
    {
      "id": 2,
      "name": "email"
    },
    {
      "id": 3,
      "name": "address"
    },
    {
      "id": 4,
      "name": "Other"
    }
  ]
}

在生产环境中会发生什么？¶

如果我们将此用于生产环境，我们可能会期望拥有某种 FastAPI 端点。

from fastapi import FastAPI

app = FastAPI()

@app.post("/tag", response_model=TagResponse)
async def tag(request: TagRequest) -> TagResponse:
    return await tag_request(request)

由于所有内容都已用 Pydantic 进行了注解，因此编写此代码非常简单！

标签从何而来？

我想指出的是，在这里你也可以想象，例如标签规范的 ID、名称和指令可以来自数据库或其他地方。我将这留作给读者的练习，但我希望这能让我们清楚地理解如何进行用户自定义分类。

改进模型¶

我们可以做几件事来使这个系统更加健壮。

使用置信度分数

class TagWithConfidence(Tag):
    confidence: float = Field(
        ...,
        ge=0,
        le=1,
        description="The confidence of the prediction, 0 is low, 1 is high",
    )

使用多类别分类

请注意，在示例中我们使用了 Iterable[Tag] 而不是 Tag。这是因为我们可能希望使用一个返回多个标签的多类别分类模型！

```python import instructor import openai import asyncio from typing import Iterable

client = instructor.from_openai( openai.AsyncOpenAI(), )

<%hide%>¶

from typing import List from pydantic import BaseModel, ValidationInfo, model_validator

class Tag(BaseModel): id: int name: str

@model_validator(mode="after")
def validate_ids(self, info: ValidationInfo):
    context = info.context
    if context:
        tags: List[Tag] = context.get("tags")
        assert self.id in {
            tag.id for tag in tags
        }, f"Tag ID {self.id} not found in context"
        assert self.name in {
            tag.name for tag in tags
        }, f"Tag name {self.name} not found in context"
    return self

<%hide%>¶

tags = [ Tag(id=0, name="personal"), Tag(id=1, name="phone"), Tag(id=2, name="email"), Tag(id=3, name="address"), Tag(id=4, name="Other"), ]

文本将是各种不同的问题。¶

例如：“这需要多少钱？”，“你们的隐私政策是什么？”，等等。¶

text = "What is your phone number?"

async def get_tags(text: List[str], tags: List[Tag]) -> List[Tag]: allowed_tags = [(tag.id, tag.name) for tag in tags] allowed_tags_str = ", ".join([f"{tag}" for tag in allowed_tags])

return await client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a world-class text tagging system.",
        },
        {"role": "user", "content": f"Describe the following text: `{text}`"},
        {
            "role": "user",
            "content": f"Here are the allowed tags: {allowed_tags_str}",
        },
    ],
    response_model=Iterable[Tag],
    validation_context={"tags": tags},
)

tag_results = asyncio.run(get_tags(text, tags)) for tag in tag_results: print(tag) #> id=1 name='phone'