从幻灯片中提取数据¶
本指南演示了如何从幻灯片中提取数据。
动机
当我们需要将幻灯片中的关键信息转换为结构化数据时,仅仅隔离文本并进行提取可能不够。有时重要数据位于幻灯片上的图片中,因此我们应考虑将其纳入提取流程。
定义必要的数据结构¶
假设我们要从各种演示文稿中提取竞争对手,并根据其各自的行业进行分类。
我们的数据模型将包含 Industry
,它是特定行业的 Competitor
列表;以及 Competition
,它将汇总所有行业的竞争对手。
from pydantic import BaseModel, Field
from typing import Optional, List
class Competitor(BaseModel):
name: str
features: Optional[List[str]]
# Define models
class Industry(BaseModel):
"""
Represents competitors from a specific industry extracted from an image using AI.
"""
name: str = Field(description="The name of the industry")
competitor_list: List[Competitor] = Field(
description="A list of competitors for this industry"
)
class Competition(BaseModel):
"""
This class serves as a structured representation of
competitors and their qualities.
"""
industry_list: List[Industry] = Field(
description="A list of industries and their competitors"
)
竞争对手提取¶
为了从幻灯片中提取竞争对手,我们将定义一个函数,该函数将从 URL 读取图片并从中提取相关信息。
import instructor
from openai import OpenAI
# Apply the patch to the OpenAI client
# enables response_model keyword
client = instructor.from_openai(OpenAI())
# Define functions
def read_images(image_urls: List[str]) -> Competition:
"""
Given a list of image URLs, identify the competitors in the images.
"""
return client.chat.completions.create(
model="gpt-4o-mini",
response_model=Competition,
max_tokens=2048,
temperature=0,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Identify competitors and generate key features for each competitor.",
},
*[
{"type": "image_url", "image_url": {"url": url}}
for url in image_urls
],
],
}
],
)
执行¶
最后,我们将运行之前的函数,并使用几个示例幻灯片来查看数据提取器的实际效果。
正如我们所见,我们的模型提取了每个竞争对手的相关信息,无论这些信息在原始演示文稿中如何格式化。
url = [
'https://miro.medium.com/v2/resize:fit:1276/0*h1Rsv-fZWzQUyOkt',
]
model = read_images(url)
print(model.model_dump_json(indent=2))
"""
{
"industry_list": [
{
"name": "Accommodation Services",
"competitor_list": [
{
"name": "CouchSurfing",
"features": [
"Free accommodation",
"Cultural exchange",
"Community-driven",
"User profiles and reviews"
]
},
{
"name": "Craigslist",
"features": [
"Local listings",
"Variety of accommodation types",
"Direct communication with hosts",
"No booking fees"
]
},
{
"name": "BedandBreakfast.com",
"features": [
"Specialized in B&Bs",
"User reviews",
"Booking options",
"Local experiences"
]
},
{
"name": "AirBed & Breakfast (Airbnb)",
"features": [
"Wide range of accommodations",
"User reviews",
"Instant booking",
"Host profiles"
]
},
{
"name": "Hostels.com",
"features": [
"Budget-friendly hostels",
"User reviews",
"Booking options",
"Global reach"
]
},
{
"name": "RentDigs.com",
"features": [
"Rental listings",
"User-friendly interface",
"Local listings",
"Direct communication with landlords"
]
},
{
"name": "VRBO",
"features": [
"Vacation rentals",
"Family-friendly options",
"User reviews",
"Booking protection"
]
},
{
"name": "Hotels.com",
"features": [
"Wide range of hotels",
"Rewards program",
"User reviews",
"Price match guarantee"
]
}
]
}
]
}
"""