跳到内容

从幻灯片中提取数据

本指南演示了如何从幻灯片中提取数据。

动机

当我们需要将幻灯片中的关键信息转换为结构化数据时,仅仅隔离文本并进行提取可能不够。有时重要数据位于幻灯片上的图片中,因此我们应考虑将其纳入提取流程。

定义必要的数据结构

假设我们要从各种演示文稿中提取竞争对手,并根据其各自的行业进行分类。

我们的数据模型将包含 Industry,它是特定行业的 Competitor 列表;以及 Competition,它将汇总所有行业的竞争对手。

from pydantic import BaseModel, Field
from typing import Optional, List


class Competitor(BaseModel):
    name: str
    features: Optional[List[str]]


# Define models
class Industry(BaseModel):
    """
    Represents competitors from a specific industry extracted from an image using AI.
    """

    name: str = Field(description="The name of the industry")
    competitor_list: List[Competitor] = Field(
        description="A list of competitors for this industry"
    )


class Competition(BaseModel):
    """
    This class serves as a structured representation of
    competitors and their qualities.
    """

    industry_list: List[Industry] = Field(
        description="A list of industries and their competitors"
    )

竞争对手提取

为了从幻灯片中提取竞争对手,我们将定义一个函数,该函数将从 URL 读取图片并从中提取相关信息。

import instructor
from openai import OpenAI

# Apply the patch to the OpenAI client
# enables response_model keyword
client = instructor.from_openai(OpenAI())


# Define functions
def read_images(image_urls: List[str]) -> Competition:
    """
    Given a list of image URLs, identify the competitors in the images.
    """
    return client.chat.completions.create(
        model="gpt-4o-mini",
        response_model=Competition,
        max_tokens=2048,
        temperature=0,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Identify competitors and generate key features for each competitor.",
                    },
                    *[
                        {"type": "image_url", "image_url": {"url": url}}
                        for url in image_urls
                    ],
                ],
            }
        ],
    )

执行

最后,我们将运行之前的函数,并使用几个示例幻灯片来查看数据提取器的实际效果。

正如我们所见,我们的模型提取了每个竞争对手的相关信息,无论这些信息在原始演示文稿中如何格式化。

url = [
    'https://miro.medium.com/v2/resize:fit:1276/0*h1Rsv-fZWzQUyOkt',
]
model = read_images(url)
print(model.model_dump_json(indent=2))
"""
{
  "industry_list": [
    {
      "name": "Accommodation Services",
      "competitor_list": [
        {
          "name": "CouchSurfing",
          "features": [
            "Free accommodation",
            "Cultural exchange",
            "Community-driven",
            "User profiles and reviews"
          ]
        },
        {
          "name": "Craigslist",
          "features": [
            "Local listings",
            "Variety of accommodation types",
            "Direct communication with hosts",
            "No booking fees"
          ]
        },
        {
          "name": "BedandBreakfast.com",
          "features": [
            "Specialized in B&Bs",
            "User reviews",
            "Booking options",
            "Local experiences"
          ]
        },
        {
          "name": "AirBed & Breakfast (Airbnb)",
          "features": [
            "Wide range of accommodations",
            "User reviews",
            "Instant booking",
            "Host profiles"
          ]
        },
        {
          "name": "Hostels.com",
          "features": [
            "Budget-friendly hostels",
            "User reviews",
            "Booking options",
            "Global reach"
          ]
        },
        {
          "name": "RentDigs.com",
          "features": [
            "Rental listings",
            "User-friendly interface",
            "Local listings",
            "Direct communication with landlords"
          ]
        },
        {
          "name": "VRBO",
          "features": [
            "Vacation rentals",
            "Family-friendly options",
            "User reviews",
            "Booking protection"
          ]
        },
        {
          "name": "Hotels.com",
          "features": [
            "Wide range of hotels",
            "Rewards program",
            "User reviews",
            "Price match guarantee"
          ]
        }
      ]
    }
  ]
}
"""