跳到内容

PII 数据提取和清理

概述

此示例演示了如何使用 OpenAI 的 ChatCompletion 模型从文档中提取和清理个人身份信息 (PII)。代码定义了 Pydantic 模型来管理 PII 数据,并提供了提取和清理两种方法。

定义结构

首先,定义 Pydantic 模型来表示 PII 数据以及用于 PII 数据提取的整体结构。

from typing import List
from pydantic import BaseModel


# Define Schemas for PII data
class Data(BaseModel):
    index: int
    data_type: str
    pii_value: str


class PIIDataExtraction(BaseModel):
    """
    Extracted PII data from a document, all data_types should try to have consistent property names
    """

    private_data: List[Data]

    def scrub_data(self, content: str) -> str:
        """
        Iterates over the private data and replaces the value with a placeholder in the form of
        <{data_type}_{i}>
        """
        for i, data in enumerate(self.private_data):
            content = content.replace(data.pii_value, f"<{data.data_type}_{i}>")
        return content

提取 PII 数据

使用 OpenAI API 从给定文档中提取 PII 信息。

from openai import OpenAI
import instructor


client = instructor.from_openai(OpenAI())

EXAMPLE_DOCUMENT = """
# Fake Document with PII for Testing PII Scrubbing Model
# (The content here)
"""

pii_data = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=PIIDataExtraction,
    messages=[
        {
            "role": "system",
            "content": "You are a world class PII scrubbing model, Extract the PII data from the following document",
        },
        {
            "role": "user",
            "content": EXAMPLE_DOCUMENT,
        },
    ],
)  # type: ignore

print("Extracted PII Data:")
#> Extracted PII Data:
print(pii_data.model_dump_json())
"""
{"private_data":[{"index":1,"data_type":"Name","pii_value":"John Doe"},{"index":2,"data_type":"Email","pii_value":"john.doe@example.com"},{"index":3,"data_type":"Phone","pii_value":"+1234567890"},{"index":4,"data_type":"Address","pii_value":"1234 Elm Street, Springfield, IL 62704"},{"index":5,"data_type":"SSN","pii_value":"123-45-6789"}]}
"""

提取的 PII 数据输出

{
  "private_data": [
    {
      "index": 0,
      "data_type": "date",
      "pii_value": "01/02/1980"
    },
    {
      "index": 1,
      "data_type": "ssn",
      "pii_value": "123-45-6789"
    },
    {
      "index": 2,
      "data_type": "email",
      "pii_value": "john.doe@email.com"
    },
    {
      "index": 3,
      "data_type": "phone",
      "pii_value": "555-123-4567"
    },
    {
      "index": 4,
      "data_type": "address",
      "pii_value": "123 Main St, Springfield, IL, 62704"
    }
  ]
}

清理 PII 数据

提取 PII 数据后,使用 scrub_data 方法清理文档。

print("Scrubbed Document:")
#> Scrubbed Document:
print(pii_data.scrub_data(EXAMPLE_DOCUMENT))
"""
# Fake Document with PII for Testing PII Scrubbing Model
# He was born on <date_0>. His social security number is <ssn_1>. He has been using the email address <email_2> for years, and he can always be reached at <phone_3>.
"""

清理后的文档输出

# Fake Document with PII for Testing PII Scrubbing Model

## Personal Story

John Doe was born on <date_0>. His social security number is <ssn_1>. He has been using the email address <email_2> for years, and he can always be reached at <phone_3>.

## Residence

John currently resides at <address_4>. He's been living there for about 5 years now.