PII 数据提取和清理¶
概述¶
此示例演示了如何使用 OpenAI 的 ChatCompletion 模型从文档中提取和清理个人身份信息 (PII)。代码定义了 Pydantic 模型来管理 PII 数据,并提供了提取和清理两种方法。
定义结构¶
首先,定义 Pydantic 模型来表示 PII 数据以及用于 PII 数据提取的整体结构。
from typing import List
from pydantic import BaseModel
# Define Schemas for PII data
class Data(BaseModel):
index: int
data_type: str
pii_value: str
class PIIDataExtraction(BaseModel):
"""
Extracted PII data from a document, all data_types should try to have consistent property names
"""
private_data: List[Data]
def scrub_data(self, content: str) -> str:
"""
Iterates over the private data and replaces the value with a placeholder in the form of
<{data_type}_{i}>
"""
for i, data in enumerate(self.private_data):
content = content.replace(data.pii_value, f"<{data.data_type}_{i}>")
return content
提取 PII 数据¶
使用 OpenAI API 从给定文档中提取 PII 信息。
from openai import OpenAI
import instructor
client = instructor.from_openai(OpenAI())
EXAMPLE_DOCUMENT = """
# Fake Document with PII for Testing PII Scrubbing Model
# (The content here)
"""
pii_data = client.chat.completions.create(
model="gpt-4o-mini",
response_model=PIIDataExtraction,
messages=[
{
"role": "system",
"content": "You are a world class PII scrubbing model, Extract the PII data from the following document",
},
{
"role": "user",
"content": EXAMPLE_DOCUMENT,
},
],
) # type: ignore
print("Extracted PII Data:")
#> Extracted PII Data:
print(pii_data.model_dump_json())
"""
{"private_data":[{"index":1,"data_type":"Name","pii_value":"John Doe"},{"index":2,"data_type":"Email","pii_value":"john.doe@example.com"},{"index":3,"data_type":"Phone","pii_value":"+1234567890"},{"index":4,"data_type":"Address","pii_value":"1234 Elm Street, Springfield, IL 62704"},{"index":5,"data_type":"SSN","pii_value":"123-45-6789"}]}
"""
提取的 PII 数据输出¶
{
"private_data": [
{
"index": 0,
"data_type": "date",
"pii_value": "01/02/1980"
},
{
"index": 1,
"data_type": "ssn",
"pii_value": "123-45-6789"
},
{
"index": 2,
"data_type": "email",
"pii_value": "john.doe@email.com"
},
{
"index": 3,
"data_type": "phone",
"pii_value": "555-123-4567"
},
{
"index": 4,
"data_type": "address",
"pii_value": "123 Main St, Springfield, IL, 62704"
}
]
}
清理 PII 数据¶
提取 PII 数据后,使用 scrub_data
方法清理文档。
print("Scrubbed Document:")
#> Scrubbed Document:
print(pii_data.scrub_data(EXAMPLE_DOCUMENT))
"""
# Fake Document with PII for Testing PII Scrubbing Model
# He was born on <date_0>. His social security number is <ssn_1>. He has been using the email address <email_2> for years, and he can always be reached at <phone_3>.
"""
清理后的文档输出¶
# Fake Document with PII for Testing PII Scrubbing Model
## Personal Story
John Doe was born on <date_0>. His social security number is <ssn_1>. He has been using the email address <email_2> for years, and he can always be reached at <phone_3>.
## Residence
John currently resides at <address_4>. He's been living there for about 5 years now.