跳至内容

使用 Gemini 处理多模态数据

本教程将展示如何将 instructorgoogle-generativeai 结合使用来处理多模态数据。在本示例中,我们将演示三种处理音频文件的方法。

我们将使用这份录音,该录音取自 Google Generative AI 指南

普通消息

处理音频文件的第一种方法是上传整个音频文件并将其作为普通消息传递给 LLM。这是最简单的入门方法,不需要任何特殊设置。


  1. 确保将模式设置为 GEMINI_JSON,这很重要,因为工具调用不适用于多模态输入。
  2. 使用 genai.upload_file 上传文件。如果文件已上传,可以使用 genai.get_file 获取。
  3. 将文件对象作为任何普通用户消息传递

内联音频片段

最大文件大小

上传和处理音频时,可以作为内联片段上传到 API 的文件大小有限制。当抛出此错误时,您就会知道。

google.api_core.exceptions.InvalidArgument: 400 Request payload size exceeds the limit: 20971520 bytes. Please upload your files with the File API instead.`f = genai.upload_file(path); m.generate_content(['tell me about this file:', f])`

对于视频文件,我们建议使用上面示例中所示的 file.upload 方法。

其次,我们还可以将音频片段作为普通消息以内联对象的形式传递,如下所示。为此,您需要安装 pydub 库。

import instructor
import google.generativeai as genai
from pydantic import BaseModel
from pydub import AudioSegment

client = instructor.from_gemini(
    client=genai.GenerativeModel(
        model_name="models/gemini-1.5-flash-latest",
    ),
    mode=instructor.Mode.GEMINI_JSON,  # (1)!
)


sound = AudioSegment.from_mp3("sample.mp3")  # (2)!
sound = sound[:60000]


class Transcription(BaseModel):
    summary: str
    exact_transcription: str


resp = client.create(
    response_model=Transcription,
    messages=[
        {
            "role": "user",
            "content": "Please transcribe this recording",
        },
        {
            "role": "user",
            "content": {
                "mime_type": "audio/mp3",
                "data": sound.export().read(),  # (3)!
            },
        },
    ],
)

print(resp)
"""
summary='President addresses the joint session of Congress,  reflecting on his first time taking the oath of federal office and the knowledge and inspiration gained.' exact_transcription="The President's state of the union address to a joint session of the Congress from the rostrum of the House of Representatives, Washington D.C. January 30th 1961 Speaker, Mr Vice President members of the Congress It is a pleasure to return from whence I came You are among my oldest friends in Washington And this house is my oldest home It was here it was here more than 14 years ago that I first took the oath of federal office It was here for 14 years that I gained both knowledge and inspiration from members of both"
"""

#> summary='President delivers a speech to a joint session of Congress,
#> highlighting his history in the House of Representatives and thanking
#> the members of Congress for their guidance.',
# >
#> exact_transcription="The President's State of the Union address to a
#> joint session of the Congress from the rostrum of the House of
#> Representatives, Washington DC, January 30th 1961. Mr. Speaker, Mr.
#> Vice-President, members of the Congress, it is a pleasure to return
#> from whence I came. You are among my oldest friends in Washington,
#> and this house is my oldest home. It was here that I first took the
#> oath of federal office. It was here for 14 years that I gained both
#> knowledge and inspiration from members of both"
  1. 确保将模式设置为 GEMINI_JSON,这很重要,因为工具调用不适用于多模态输入。
  2. 使用 AudioSegment.from_mp3 加载您的音频文件。
  3. 将音频数据作为字节传递给 data 字段,使用字典格式的内容,其中包含正确的 mime_type 和作为字节的 data

内容列表

根据 google-generativeai 的文档,我们也支持将这些作为单个列表传递。以下是使用同一录音中的音频片段实现此操作的方法。

请注意,该列表可以包含普通用户消息以及文件对象。它非常灵活。

import instructor
import google.generativeai as genai
from pydantic import BaseModel


client = instructor.from_gemini(
    client=genai.GenerativeModel(
        model_name="models/gemini-1.5-flash-latest",
    ),
    mode=instructor.Mode.GEMINI_JSON,  # (1)!
)

mp3_file = genai.upload_file("./sample.mp3")  # (2)!


class Description(BaseModel):
    description: str


content = [
    "Summarize what's happening in this audio file and who the main speaker is",
    mp3_file,  # (3)!
]

resp = client.create(
    response_model=Description,
    messages=[
        {
            "role": "user",
            "content": content,
        }
    ],
)

print(resp)
"""
description = 'President John F. Kennedy delivers his State of the Union address to the Congress on January 30, 1961. The speech was delivered at the rostrum of the House of Representatives in Washington, D.C.'
"""
  1. 确保将模式设置为 GEMINI_JSON,这很重要,因为工具调用不适用于多模态输入。
  2. 使用 genai.upload_file 上传文件或使用 genai.get_file 获取文件
  3. 将内容作为包含普通用户消息和文件对象的列表传递。