跳到内容

文档分段

在本指南中,我们将演示如何使用 LLM 的结构化输出进行文档分段。我们将使用 command-r-plus——Cohere 最新的 LLM 之一,具有 128k 的上下文长度,并将在解释 Transformer 架构的文章上测试该方法。这种文档分段方法也可以应用于任何其他需要将复杂长文档分解成更小块的领域。

动机

有时,我们需要一种方法将文档分成围绕一个关键概念/思想的有意义的部分。简单的基于长度/规则的文本分割器不够可靠。考虑文档包含代码片段或数学公式的情况——我们不想在 '\n\n' 处分割它们,也不想为不同类型的文档编写大量规则。事实证明,具有足够长上下文长度的 LLM 非常适合这项任务。

定义数据结构

首先,我们需要为文档的每个分段定义一个 Section 类。然后,StructuredDocument 类将封装这些分段的列表。

请注意,为了避免 LLM 重复生成每个分段的内容,我们可以简单地枚举输入文档的每一行,然后通过为每个分段提供开始-结束行号来要求 LLM 对其进行分段。

from pydantic import BaseModel, Field
from typing import List


class Section(BaseModel):
    title: str = Field(description="main topic of this section of the document")
    start_index: int = Field(description="line number where the section begins")
    end_index: int = Field(description="line number where the section ends")


class StructuredDocument(BaseModel):
    """obtains meaningful sections, each centered around a single concept/topic"""

    sections: List[Section] = Field(description="a list of sections of the document")

文档预处理

通过在每行前面加上行号来预处理输入 document

def doc_with_lines(document):
    document_lines = document.split("\n")
    document_with_line_numbers = ""
    line2text = {}
    for i, line in enumerate(document_lines):
        document_with_line_numbers += f"[{i}] {line}\n"
        line2text[i] = line
    return document_with_line_numbers, line2text

分段

接下来使用 Cohere 客户端从预处理的文档中提取 StructuredDocument

import instructor
import cohere

# Apply the patch to the cohere client
# enables response_model keyword
client = instructor.from_cohere(cohere.Client())


system_prompt = f"""\
You are a world class educator working on organizing your lecture notes.
Read the document below and extract a StructuredDocument object from it where each section of the document is centered around a single concept/topic that can be taught in one lesson.
Each line of the document is marked with its line number in square brackets (e.g. [1], [2], [3], etc). Use the line numbers to indicate section start and end.
"""


def get_structured_document(document_with_line_numbers) -> StructuredDocument:
    return client.chat.completions.create(
        model="command-r-plus",
        response_model=StructuredDocument,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": document_with_line_numbers,
            },
        ],
    )  # type: ignore

接下来,我们需要根据开始/结束索引以及预处理步骤中的 line2text 字典获取分段文本。

def get_sections_text(structured_doc, line2text):
    segments = []
    for s in structured_doc.sections:
        contents = []
        for line_id in range(s.start_index, s.end_index):
            contents.append(line2text.get(line_id, ''))
        segments.append(
            {
                "title": s.title,
                "content": "\n".join(contents),
                "start": s.start_index,
                "end": s.end_index,
            }
        )
    return segments

示例

以下是使用这些类和函数对 Sebastian Raschka 的 Transformer 教程进行分段的示例。我们可以使用 trafilatura 包抓取文章的网页内容。

from trafilatura import fetch_url, extract


url = 'https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html'
downloaded = fetch_url(url)
document = extract(downloaded)


document_with_line_numbers, line2text = doc_with_lines(document)
structured_doc = get_structured_document(document_with_line_numbers)
segments = get_sections_text(structured_doc, line2text)
print(segments[5]['title'])
"""
Introduction to Multi-Head Attention
"""
print(segments[5]['content'])
"""
Multi-Head Attention
In the very first figure, at the top of this article, we saw that transformers use a module called multi-head attention. How does that relate to the self-attention mechanism (scaled-dot product attention) we walked through above?
In the scaled dot-product attention, the input sequence was transformed using three matrices representing the query, key, and value. These three matrices can be considered as a single attention head in the context of multi-head attention. The figure below summarizes this single attention head we covered previously:
As its name implies, multi-head attention involves multiple such heads, each consisting of query, key, and value matrices. This concept is similar to the use of multiple kernels in convolutional neural networks.
To illustrate this in code, suppose we have 3 attention heads, so we now extend the \(d' \times d\) dimensional weight matrices so \(3 \times d' \times d\):
In:
h = 3
multihead_W_query = torch.nn.Parameter(torch.rand(h, d_q, d))
multihead_W_key = torch.nn.Parameter(torch.rand(h, d_k, d))
multihead_W_value = torch.nn.Parameter(torch.rand(h, d_v, d))
Consequently, each query element is now \(3 \times d_q\) dimensional, where \(d_q=24\) (here, let’s keep the focus on the 3rd element corresponding to index position 2):
In:
multihead_query_2 = multihead_W_query.matmul(x_2)
print(multihead_query_2.shape)
Out:
torch.Size([3, 24])
"""