优先处理不确定示例

当我们有大量可用于提示的未标注示例时，如何决定手动标注哪些示例？

主动提示 (Active prompting) 是一种用于识别人工标注最有效示例的方法。该过程包含四个关键步骤：

不确定性估计：评估 LLM 对每个可能示例预测的不确定性。
选择：选择最不确定的示例进行人工标注。
标注：让人工标注选定的示例。
推理：使用新标注的数据改进 LLM 的性能。

不确定性估计¶

在这一步中，我们定义了一种无监督方法来衡量 LLM 在回答给定示例时的不确定性。

不确定性估计示例

假设我们向 LLM 提出以下查询：

query = "将此句的情感归类为积极或消极：我今天非常兴奋。"

LLM 返回

response = "positive"

不确定性估计的目标是回答：LLM 对此回复有多确定？

为此，我们使用相同的示例查询 LLM *k* 次。然后，我们使用这 *k* 个回复来确定这些回复的差异程度。以下是三种可能的衡量指标¹：

分歧度：唯一回复数与总回复数的比率。
熵：基于每个回复频率的度量。
方差：计算数值回复的离散程度。

下面是使用分歧度不确定性指标对单个输入示例进行不确定性估计的示例。

import instructor
from pydantic import BaseModel
from openai import OpenAI


class Response(BaseModel):
    height: int


client = instructor.from_openai(OpenAI())


def query_llm():
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=Response,
        messages=[
            {
                "role": "user",
                "content": "How tall is the Empire State Building in meters?",
            }
        ],
    )


def calculate_disagreement(responses):
    unique_responses = set(responses)
    h = len(unique_responses)
    return h / k


if __name__ == "__main__":
    k = 5  # (1)!
    responses = [query_llm() for _ in range(k)]  # Query the LLM k times
    for response in responses:
        print(response)
        #> height=443
        #> height=443
        #> height=443
        #> height=443
        #> height=381

    print(
        calculate_disagreement([response.height for response in responses])
    )  # Calculate the uncertainty metric
    #> 0.4

*k* 是使用单个未标注示例查询 LLM 的次数。

然后将对所有未标注的示例重复此过程。

选择与标注¶

一旦我们有了一组示例及其不确定性，就可以选择其中的 *n* 个进行人工标注。在此，我们选择不确定性最高的示例。

推理¶

现在，每次向 LLM 提示时，我们都可以包含新标注的示例。

参考文献¶

¹: 大语言模型思维链主动提示 (Active Prompting with Chain-of-Thought for Large Language Models)

^*: 提示报告：提示技术系统综述 (The Prompt Report: A Systematic Survey of Prompting Techniques)