Privacy Filter Nigeria: A Domain Adapter for Nigerian Identity Systems

OpenAI's Privacy Filter handles eight generic PII categories — names, emails, phones, addresses, dates, URLs, account numbers, and secrets. Privacy Filter Nigeria, which we just released as a research preview, adds five more — NIN, BVN, passport number, driver's license, and voter's card — and adapts the rest of the taxonomy for Nigerian formats and document conventions.

It's a LoRA adapter on top of openai/privacy-filter, available now on Hugging Face and GitHub.

Why this matters in practice

In a generic detector, two 11-digit numbers look the same. Here is real text the two models handle differently:

My name is Ciroma John. I live at No. 11 Yaba street Ikeja. Review completed with no further action required at this stage; the record was checked against 22443690465 and 34488606925 for consistency with the supporting documentation.

The base Privacy Filter detects both 22443690465 and 34488606925 as a generic account_number:

{
  "entities": [
    { "entity_group": "private_person",  "word": "Ciroma John",              "score": 1.0 },
    { "entity_group": "private_address", "word": "No. 11 Yaba street Ikeja", "score": 1.0 },
    { "entity_group": "account_number",  "word": "22443690465",              "score": 1.0 },
    { "entity_group": "account_number",  "word": "34488606925",              "score": 1.0 }
  ]
}

Privacy Filter Nigeria correctly resolves them as a BVN and a NIN — the labels you actually need:

{
  "entities": [
    { "entity_group": "private_person",  "word": "Ciroma John",              "score": 1.00 },
    { "entity_group": "private_address", "word": "No. 11 Yaba street Ikeja", "score": 0.998 },
    { "entity_group": "private_bvn",     "word": "22443690465",              "score": 0.991 },
    { "entity_group": "private_nin",     "word": "34488606925",              "score": 1.00 }
  ]
}

That distinction is not academic. BVN (issued under the Central Bank of Nigeria) and NIN (issued by NIMC) sit under different regulatory regimes and have different handling expectations under the Nigeria Data Protection Act 2023. A pipeline that collapses them into one label loses the information needed to apply those rules correctly — for retention, for access control, for sharing decisions, and for audit.

The broader case

Nigerian fintech, lending, and identity-verification workflows generate a lot of PII-heavy text: KYC notes, compliance reviews, fraud-investigation logs, OCR output from NIMC slips and bank statements, customer-service tickets quoting NUBAN account numbers and phone numbers in local formats. Generic detectors catch the easy cases (names, emails) but tend to either miss or over-collapse the locally important entities — exactly the failure mode shown above.

The adapter inherits the properties that make the base model practical for production — bidirectional token classification, single-pass span decoding, on-device execution — but with the label vocabulary and training data tuned for entities that matter in Nigerian text.

What's in the release

The repository includes a CLI runner, a FastAPI wrapper, deterministic span post-processing, LoRA fine-tuning and eval-only workflows, model and dataset cards, a small synthetic example dataset for schema inspection, and documentation for publishing and evaluating adapters.

Supported labels

Identity documents (new in this adapter):

private_nin
private_bvn
private_passport_number
private_drivers_license_number
private_voters_card_number

Financial:

account_number

Contact and identity:

private_person
private_email
private_phone
private_address

Other:

private_date
private_url
secret

Evaluation

The adapter was evaluated on a private mixed dataset combining synthetic examples, OCR-derived samples from Nigerian identity documents, and real-world domain text. Sensitive fields in the source data were annotated for evaluation and then redacted before any further use; source materials and derived artifacts are not redistributed.

Validation and test results on this dataset are strong, but the adapter is intentionally recall-oriented. The hard-negative challenge split — text containing benign identifier-like numbers such as invoice IDs and internal reference codes — shows the model still over-redacts in some cases. For precision-sensitive use, downstream users should add filters, tune thresholds, or fine-tune further on their own representative data.

Research preview, not a compliance product

This release adopts the framing of the upstream openai/privacy-filter release: it is a redaction and data-minimization aid, not an anonymization, compliance, or safety guarantee. It should not be used as the only control for legal, medical, financial, regulatory, or otherwise irreversible privacy decisions.

Any production deployment should include representative local evaluation, policy review, access controls, logging discipline, and human review where mistakes could cause harm.

Feedback welcome

We're particularly interested in:

Nigerian-domain eval examples we can incorporate into public synthetic evals
Hard-negative cases (benign numbers that look like IDs)
OCR and document-layout examples, especially noisy NIMC and bank-statement output
Label-boundary disagreements — places where you'd label something differently than we did
False positives on benign numeric IDs
Additional local formats that should be represented

If you try it, open an issue on the GitHub repository and share what worked and what failed. We read everything.