使用结构化提取从图像中提取元数据

像 gpt-4o 这样的多模态语言模型在处理多模态数据方面表现出色，使我们能够从图像中提取丰富、结构化的元数据。

这在时尚等领域尤其有价值，我们可以利用这些能力从图像甚至视频中了解用户的风格偏好。在这篇文章中，我们将看到如何使用 instructor 将图像映射到给定的产品分类体系，以便我们可以为用户推荐类似的产品。

为什么图像元数据有用¶

大多数在线电商商店都有自己的产品分类体系。这是一种对产品进行分类的方式，以便用户可以轻松找到他们想要的东西。

下面展示了一个分类体系的小示例。你可以将其视为一种将产品映射到一组属性的方式，其中包含所有产品共有的通用属性。

tops:
  t-shirts:
    - crew_neck
    - v_neck
    - graphic_tees
  sweaters:
    - crewneck
    - cardigan
    - pullover
  jackets:
    - bomber_jackets
    - denim_jackets
    - leather_jackets

bottoms:
  pants:
    - chinos
    - dress_pants
    - cargo_pants
  shorts:
    - athletic_shorts
    - cargo_shorts

colors:
  - black
  - navy
  - white
  - beige
  - brown

通过使用这种分类体系，我们可以确保我们的模型能够提取与我们销售的产品一致的元数据。在这个例子中，我们将分析一位健身博主的风格照片，以了解他们的时尚偏好，并可能从中看到我们可以从自己的产品目录中推荐给他的产品。

我们使用了一些来自一位名叫 Jpgeez 的健身博主的照片，你可以在下面看到。

虽然我们将这些视觉元素映射到分类体系，但这实际上适用于任何其他需要从图像中提取元数据的用例。

从图像中提取元数据¶

Instructor的`Image`类¶

使用 instructor，处理 multimodal 数据很容易。我们可以使用 Image 类从 URL 或本地文件加载图像。我们可以在下面的示例中看到。

import instructor

# Load images using instructor.Image.from_path
images = []
for image_file in image_files:
    image_path = os.path.join("./images", image_file)
    image = instructor.Image.from_path(image_path)
    images.append(image)

我们提供了多种不同的方法来加载图像，包括从 URL、本地文件，甚至是从 base64 编码的字符串，你可以在这里阅读相关内容

定义响应模型¶

由于我们的分类体系定义为 yaml 文件，我们不能使用文字来定义响应模型。相反，我们可以从 yaml 文件中读取配置，然后将其用于 model_validator 步骤中，以确保我们提取的元数据与分类体系一致。

首先，我们从 yaml 文件中读取分类体系，并创建一系列类别、子类别和产品类型。

import yaml

with open("taxonomy.yml", "r") as file:
    taxonomy = yaml.safe_load(file)

colors = taxonomy["colors"]
categories = set(taxonomy.keys())
categories.remove("colors")

subcategories = set()
product_types = set()
for category in categories:
    for subcategory in taxonomy[category].keys():
        subcategories.add(subcategory)
        for product_type in taxonomy[category][subcategory]:
            product_types.add(product_type)

然后我们可以在我们的 response_model 中使用这些信息，以确保我们提取的元数据与分类体系一致。

class PersonalStyle(BaseModel):
    """
    Ideally you map this to a specific taxonomy
    """

    categories: list[str]
    subcategories: list[str]
    product_types: list[str]
    colors: list[str]

    @model_validator(mode="after")
    def validate_options(self, info: ValidationInfo):
        context = info.context
        colors = context["colors"]
        categories = context["categories"]
        subcategories = context["subcategories"]
        product_types = context["product_types"]

        # Validate colors
        for color in self.colors:
            if color not in colors:
                raise ValueError(
                    f"Color {color} is not in the taxonomy. Valid colors are {colors}"
                )
        for category in self.categories:
            if category not in categories:
                raise ValueError(
                    f"Category {category} is not in the taxonomy. Valid categories are {categories}"
                )

        for subcategory in self.subcategories:
            if subcategory not in subcategories:
                raise ValueError(
                    f"Subcategory {subcategory} is not in the taxonomy. Valid subcategories are {subcategories}"
                )

        for product_type in self.product_types:
            if product_type not in product_types:
                raise ValueError(
                    f"Product type {product_type} is not in the taxonomy. Valid product types are {product_types}"
                )

        return self

进行API调用¶

最后，我们可以将所有这些组合成一个 api 调用，调用 gpt-4o，并将所有图像和响应模型作为 response_model 参数传递进去。

借助我们对 jinja 格式化的内置支持，使用 context 关键字公开的数据，我们也可以在验证中重复使用这些数据，这使得执行这一步骤变得异常简单。

import openai
import instructor

client = instructor.from_openai(openai.OpenAI())

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": """
You are a helpful assistant. You are given a list of images and you need to map the person style of the person in the image to a given taxonomy.

Here is the taxonomy that you should use

Colors:
{% for color in colors %}
* {{ color }}
{% endfor %}

Categories:
{% for category in categories %}
* {{ category }}
{% endfor %}

Subcategories:
{% for subcategory in subcategories %}
* {{ subcategory }}
{% endfor %}

Product types:
{% for product_type in product_types %}
* {{ product_type }}
{% endfor %}
""",
        },
        {
            "role": "user",
            "content": [
                "Here are the images of the person, describe the personal style of the person in the image from a first-person perspective( Eg. You are ... )",
                *images,
            ],
        },
    ],
    response_model=PersonalStyle,
    context={
        "colors": colors,
        "categories": list(categories),
        "subcategories": list(subcategories),
        "product_types": list(product_types),
    },
)

然后这将返回以下响应。

PersonalStyle(
    categories=['tops', 'bottoms'],
    subcategories=['sweaters', 'jackets', 'pants'],
    product_types=['cardigan', 'crewneck', 'denim_jackets', 'chinos'],
    colors=['brown', 'beige', 'black', 'white', 'navy']
)

展望未来¶

从图像中提取结构化元数据的能力为电商中的个性化打开了令人兴奋的可能性。关键是通过明确定义的分类体系和强大的验证来维持非结构化视觉灵感与结构化产品数据之间的桥梁。

instructor 使处理 multimodal 数据变得容易，我们很高兴看到你用它来构建什么。今天就通过 pip install instructor 试用一下，看看使用结构化提取处理语言模型有多容易。