人们使用 Instructor 的目的是生成合成数据,而不是提取数据本身。我们甚至可以使用 J-Schema 的额外字段来提供特定示例,以控制数据的生成方式。
考虑下面的示例。我们很可能只会生成非常简单的名字。
from typing import Iterable
from pydantic import BaseModel
import instructor
from openai import OpenAI
# Define the UserDetail model
class UserDetail(BaseModel):
name: str
age: int
# Patch the OpenAI client to enable the response_model functionality
client = instructor.from_openai(OpenAI())
def generate_fake_users(count: int) -> Iterable[UserDetail]:
return client.chat.completions.create(
model="gpt-3.5-turbo",
response_model=Iterable[UserDetail],
messages=[
{"role": "user", "content": f"Generate a {count} synthetic users"},
],
)
for user in generate_fake_users(5):
print(user)
#> name='Alice' age=25
#> name='Bob' age=30
#> name='Charlie' age=35
#> name='David' age=40
#> name='Eve' age=22
我们可能希望通过利用 Pydantic 的配置将示例设置为提示的一部分。我们可以直接在 JSON Schema 本身中设置示例。
from typing import Iterable
from pydantic import BaseModel, Field
import instructor
from openai import OpenAI
# Define the UserDetail model
class UserDetail(BaseModel):
name: str = Field(examples=["Timothee Chalamet", "Zendaya"])
age: int
# Patch the OpenAI client to enable the response_model functionality
client = instructor.from_openai(OpenAI())
def generate_fake_users(count: int) -> Iterable[UserDetail]:
return client.chat.completions.create(
model="gpt-3.5-turbo",
response_model=Iterable[UserDetail],
messages=[
{"role": "user", "content": f"Generate a {count} synthetic users"},
],
)
for user in generate_fake_users(5):
print(user)
#> name='John Doe' age=25
#> name='Jane Smith' age=30
#> name='Michael Johnson' age=22
#> name='Emily Davis' age=28
#> name='David Brown' age=35
通过将名人姓名作为示例,我们已经转向生成包含知名人物的合成数据,从而摆脱了之前使用的简单单字名称。
为了有效地生成更细致的合成示例,让我们升级到 "gpt-4-turbo-preview" 模型,使用模型级别的示例而不是属性级别的示例
import instructor
from typing import Iterable
from pydantic import BaseModel, ConfigDict
from openai import OpenAI
# Define the UserDetail model
class UserDetail(BaseModel):
"""Old Wizards"""
name: str
age: int
model_config = ConfigDict(
json_schema_extra={
"examples": [
{"name": "Gandalf the Grey", "age": 1000},
{"name": "Albus Dumbledore", "age": 150},
]
}
)
# Patch the OpenAI client to enable the response_model functionality
client = instructor.from_openai(OpenAI())
def generate_fake_users(count: int) -> Iterable[UserDetail]:
return client.chat.completions.create(
model="gpt-4-turbo-preview",
response_model=Iterable[UserDetail],
messages=[
{"role": "user", "content": f"Generate `{count}` synthetic examples"},
],
)
for user in generate_fake_users(5):
print(user)
#> name='Merlin' age=1000
#> name='Saruman the White' age=700
#> name='Radagast the Brown' age=600
#> name='Elminster Aumar' age=1200
#> name='Mordenkainen' age=850
通过调整 Pydantic 模型中的描述,我们可以巧妙地影响生成的合成数据的性质。这种方法可以对输出进行更细致的控制,确保生成的数据更紧密地符合我们的预期或要求。
例如,将“听起来像花哨法语的名字”指定为 UserDetail
模型中 name
字段的描述,会指导生成过程产生符合此特定标准的名字,从而得到一个既多样化又符合特定语言特征的数据集。
import instructor
from typing import Iterable
from pydantic import BaseModel, Field
from openai import OpenAI
# Define the UserDetail model
class UserDetail(BaseModel):
name: str = Field(description="Fancy French sounding names")
age: int
# Patch the OpenAI client to enable the response_model functionality
client = instructor.from_openai(OpenAI())
def generate_fake_users(count: int) -> Iterable[UserDetail]:
return client.chat.completions.create(
model="gpt-3.5-turbo",
response_model=Iterable[UserDetail],
messages=[
{"role": "user", "content": f"Generate `{count}` synthetic users"},
],
)
for user in generate_fake_users(5):
print(user)
#> name='Jean Luc' age=30
#> name='Claire Belle' age=25
#> name='Pierre Leclair' age=40
#> name='Amelie Rousseau' age=35
#> name='Etienne Lefevre' age=28