跳到内容

多模态 Gemini 的结构化输出

在这篇文章中,我们将探讨如何使用 Google 的 Gemini 模型结合 Instructor 来分析旅行视频并提取结构化推荐。这种强大的组合允许我们处理多模态输入(视频)并使用 Pydantic 模型生成结构化输出。这篇文章是与 Kino.ai 合作完成的,Kino.ai 是一家使用 instructor 从多模态输入中进行结构化提取以改进电影制作者搜索的公司。

设置环境

首先,让我们设置环境并安装必要的库


定义数据模型

我们将使用 Pydantic 定义旅游目的地和推荐的数据模型

class TouristDestination(BaseModel):
    name: str
    description: str
    location: str


class Recommendations(BaseModel):
    chain_of_thought: str
    description: str
    destinations: list[TouristDestination]

初始化 Gemini 客户端

接下来,我们将使用 Instructor 设置 Gemini 客户端

client = instructor.from_gemini(
    client=genai.GenerativeModel(
        model_name="models/gemini-1.5-flash-latest",
    ),
)

上传和处理视频

要分析视频,我们首先需要上传它

file = genai.upload_file("./takayama.mp4")

然后,我们可以处理视频并提取推荐

resp = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": ["What places do they recommend in this video?", file],
        }
    ],
    response_model=Recommendations,
)

print(resp)
展开查看原始结果
Recomendations(
    chain_of_thought='The video recommends visiting Takayama city, in the Hida Region, Gifu Prefecture. The
video suggests visiting the Miyagawa Morning Market, to try the Sarubobo good luck charms, and to enjoy the
cookie cup espresso, made by Koma Coffee. Then, the video suggests visiting a traditional Japanese Cafe,
called Kissako Katsure, and try their matcha and sweets. Afterwards, the video suggests to visit the Sanmachi
Historic District, where you can find local crafts and delicious foods. The video recommends trying Hida Wagyu
beef, at the Kin no Kotte Ushi shop, or to have a sit-down meal at the Kitchen Hida. Finally, the video
recommends visiting Shirakawa-go, a World Heritage Site in Gifu Prefecture.',
    description='This video recommends a number of places to visit in Takayama city, in the Hida Region, Gifu
Prefecture. It shows some of the local street food and highlights some of the unique shops and restaurants in
the area.',
    destinations=[
        TouristDestination(
            name='Takayama',
            description='Takayama is a city at the base of the Japan Alps, located in the Hida Region of
Gifu.',
            location='Hida Region, Gifu Prefecture'
        ),
        TouristDestination(
            name='Miyagawa Morning Market',
            description="The Miyagawa Morning Market, or the Miyagawa Asai-chi in Japanese, is a market that
has existed officially since the Edo Period, more than 100 years ago. It's open every single day, rain or
shine, from 7am to noon.",
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Nakaya - Handmade Hida Sarubobo',
            description='The Nakaya shop sells handcrafted Sarubobo good luck charms.',
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Koma Coffee',
            description="Koma Coffee is a shop that has been in business for about 50 or 60 years, and they
serve coffee in a cookie cup. They've been serving coffee for about 10 years.",
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Kissako Katsure',
            description='Kissako Katsure is a traditional Japanese style cafe, called Kissako, and the name
means would you like to have some tea. They have a variety of teas and sweets.',
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Sanmachi Historic District',
            description='Sanmachi Dori is a Historic Merchant District in Takayama, all of the buildings here
have been preserved to look as they did in the Edo Period.',
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Suwa Orchard',
            description='The Suwa Orchard has been in business for more than 50 years.',
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Kitchen HIDA',
            description='Kitchen HIDA is a restaurant with a 50 year history, known for their Hida Beef dishes
and for using a lot of local ingredients.',
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Kin no Kotte Ushi',
            description='Kin no Kotte Ushi is a shop known for selling Beef Sushi, especially Hida Wagyu Beef
Sushi. Their sushi is medium rare.',
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Shirakawa-go',
            description='Shirakawa-go is a World Heritage Site in Gifu Prefecture.',
            location='Gifu Prefecture'
        )
    ]
)

Gemini 模型分析视频并提供结构化推荐。以下是提取信息的摘要:

  1. 高山市:主要目的地,位于岐阜县飞驒地区。
  2. 宫川朝市:一个历史悠久的市场,每天早上 7 点到中午开放。
  3. 中屋商店:销售手工制作的 Sarubobo 幸运符。
  4. 古间咖啡:一家拥有 50-60 年历史的咖啡店,以用饼干杯装咖啡而闻名。
  5. 喫茶去かつら:一家传统的日式咖啡馆,提供各种茶和甜点。
  6. 三町古街:保留完好的江户时代商家区。
  7. 诹访果园:一家拥有 50 多年历史的果园企业。
  8. 飞驒厨房:一家拥有 50 年历史的餐厅,以飞驒牛肉菜肴而闻名。
  9. 金のこって牛:一家专门经营飞驒和牛寿司的商店。
  10. 白川乡:岐阜县的一个世界遗产地。

局限性、挑战和未来方向

尽管目前的方法展示了多模态 AI 用于视频分析的强大能力,但仍有一些局限性和挑战需要考虑:

  1. 缺乏时间信息:我们目前的方法提取的是总体推荐,但不提供特定提及的时间戳。这限制了将推荐与视频中的确切时刻关联的能力。

  2. 说话人分离:模型无法区分视频中不同的说话人。实施说话人分离可以为谁正在提出特定推荐提供有价值的上下文。

  3. 内容密度:更长或更复杂的视频可能会使模型不堪重负,可能导致遗漏信息或提取不准确。

未来探索

为了解决这些局限性并扩展视频分析系统的功能,以下是一些值得探索的有前景的领域:

  1. 时间戳提取:增强模型以提供视频中提及的每个推荐或兴趣点的时间戳。这可以通过以下方式实现:
class TimestampedRecommendation(BaseModel):
    timestamp: str
    timestamp_format: Literal["HH:MM", "HH:MM:SS"]  # Helps with parsing
    recommendation: str


class EnhancedRecommendations(BaseModel):
    destinations: list[TouristDestination]
    timestamped_mentions: list[TimestampedRecommendation]
  1. 说话人分离:实施说话人识别,将推荐归属于特定个人。这对于包含多位主持人或受访者的视频特别有用。

  2. 基于片段的分析:将较长的视频分段处理,以保持准确性并捕获所有相关信息。这种方法可能包括:

  3. 将视频分成更小的片段

  4. 分别分析每个片段
  5. 聚合和去重结果

  6. 多语言支持:扩展模型的能力,以准确分析各种语言的视频,并捕获具有文化特色的推荐。

  7. 视觉元素分析:增强模型以识别和描述视频中显示的视觉元素,如地标、食物菜肴或活动,即使音频中没有明确提及。

  8. 情感分析:融入情感分析,评估说话人对特定推荐的热情或保留意见。

通过解决这些挑战和探索这些新方向,我们可以创建一个更全面、更细致的视频分析系统,为旅行、教育及其他领域的应用开启更多可能性。