简单合成数据生成¶
人们使用 Instructor 的用途是生成合成数据,而不是提取数据本身。我们甚至可以使用 J-Schemo 的额外字段来提供具体示例,以控制数据生成的方式。
考虑下面的示例。我们很可能会生成非常简单的名称。
from typing import Iterable
from pydantic import BaseModel
import instructor
from openai import OpenAI
# Define the UserDetail model
class UserDetail(BaseModel):
name: str
age: int
# Patch the OpenAI client to enable the response_model functionality
client = instructor.from_openai(OpenAI())
def generate_fake_users(count: int) -> Iterable[UserDetail]:
return client.chat.completions.create(
model="gpt-3.5-turbo",
response_model=Iterable[UserDetail],
messages=[
{"role": "user", "content": f"Generate a {count} synthetic users"},
],
)
for user in generate_fake_users(5):
print(user)
#> name='Alice' age=25
#> name='Bob' age=30
#> name='Charlie' age=35
#> name='David' age=40
#> name='Eve' age=22
利用简单示例¶
我们可能希望通过利用 Pydantic 的配置,将示例设置为提示的一部分。我们可以直接在 JSON scheme 本身中设置示例。
from typing import Iterable
from pydantic import BaseModel, Field
import instructor
from openai import OpenAI
# Define the UserDetail model
class UserDetail(BaseModel):
name: str = Field(examples=["Timothee Chalamet", "Zendaya"])
age: int
# Patch the OpenAI client to enable the response_model functionality
client = instructor.from_openai(OpenAI())
def generate_fake_users(count: int) -> Iterable[UserDetail]:
return client.chat.completions.create(
model="gpt-3.5-turbo",
response_model=Iterable[UserDetail],
messages=[
{"role": "user", "content": f"Generate a {count} synthetic users"},
],
)
for user in generate_fake_users(5):
print(user)
#> name='John Doe' age=25
#> name='Jane Smith' age=30
#> name='Michael Johnson' age=22
#> name='Emily Davis' age=28
#> name='David Brown' age=35
通过引入名人姓名作为示例,我们已转向生成包含知名人物的合成数据,摆脱了之前使用的简单、单字的名称。
利用复杂示例¶
为了有效地生成更具细节的合成示例,让我们升级到 "gpt-4-turbo-preview" 模型,并使用模型级别的示例,而不是属性级别的示例。
import instructor
from typing import Iterable
from pydantic import BaseModel, ConfigDict
from openai import OpenAI
# Define the UserDetail model
class UserDetail(BaseModel):
"""Old Wizards"""
name: str
age: int
model_config = ConfigDict(
json_schema_extra={
"examples": [
{"name": "Gandalf the Grey", "age": 1000},
{"name": "Albus Dumbledore", "age": 150},
]
}
)
# Patch the OpenAI client to enable the response_model functionality
client = instructor.from_openai(OpenAI())
def generate_fake_users(count: int) -> Iterable[UserDetail]:
return client.chat.completions.create(
model="gpt-4-turbo-preview",
response_model=Iterable[UserDetail],
messages=[
{"role": "user", "content": f"Generate `{count}` synthetic examples"},
],
)
for user in generate_fake_users(5):
print(user)
#> name='Merlin' age=1000
#> name='Saruman the White' age=700
#> name='Radagast the Brown' age=600
#> name='Elminster Aumar' age=1200
#> name='Mordenkainen' age=850
利用描述¶
通过调整 Pydantic 模型中的描述,我们可以巧妙地影响生成的合成数据的性质。这种方法可以对输出进行更精细的控制,确保生成的数据更紧密地符合我们的预期或要求。
例如,在我们的 `UserDetail` 模型中,将 `name` 字段的描述指定为“听起来很花哨的法国名字”,可以指导生成过程产生符合这一特定条件的名称,从而得到一个既多样化又符合特定语言特征的数据集。
import instructor
from typing import Iterable
from pydantic import BaseModel, Field
from openai import OpenAI
# Define the UserDetail model
class UserDetail(BaseModel):
name: str = Field(description="Fancy French sounding names")
age: int
# Patch the OpenAI client to enable the response_model functionality
client = instructor.from_openai(OpenAI())
def generate_fake_users(count: int) -> Iterable[UserDetail]:
return client.chat.completions.create(
model="gpt-3.5-turbo",
response_model=Iterable[UserDetail],
messages=[
{"role": "user", "content": f"Generate `{count}` synthetic users"},
],
)
for user in generate_fake_users(5):
print(user)
#> name='Jean Luc' age=30
#> name='Claire Belle' age=25
#> name='Pierre Leclair' age=40
#> name='Amelie Rousseau' age=35
#> name='Etienne Lefevre' age=28