OpenAI's Privacy Filter handles eight generic PII categories — names, emails, phones, addresses, dates, URLs, account numbers, and secrets. Privacy Filter Nigeria, which we just released as a research preview, adds five more — NIN, BVN, passport number, driver's license, and voter's card — and adapts the rest of the taxonomy for Nigerian formats and document conventions.
It's a LoRA adapter on top of openai/privacy-filter, available now on Hugging Face and GitHub.
Why this matters in practice
In a generic detector, two 11-digit numbers look the same. Here is real text the two models handle differently:
My name is Ciroma John. I live at No. 11 Yaba street Ikeja. Review completed with no further action required at this stage; the record was checked against 22443690465 and 34488606925 for consistency with the supporting documentation.
The base Privacy Filter detects both 22443690465 and 34488606925 as a generic account_number:
{
"entities": [
{ "entity_group": "private_person", "word": "Ciroma John", "score": 1.0 },
{ "entity_group": "private_address", "word": "No. 11 Yaba street Ikeja", "score": 1.0 },
{ "entity_group": "account_number", "word": "22443690465", "score": 1.0 },
{ "entity_group": "account_number", "word": "34488606925", "score": 1.0 }
]
}
Privacy Filter Nigeria correctly resolves them as a BVN and a NIN — the labels you actually need:
{
"entities": [
{ "entity_group": "private_person", "word": "Ciroma John", "score": 1.00 },
{ "entity_group": "private_address", "word": "No. 11 Yaba street Ikeja", "score": 0.998 },
{ "entity_group": "private_bvn", "word": "22443690465", "score": 0.991 },
{ "entity_group": "private_nin", "word": "34488606925", "score": 1.00 }
]
}
That distinction is not academic. BVN (issued under the Central Bank of Nigeria) and NIN (issued by NIMC) sit under different regulatory regimes and have different handling expectations under the Nigeria Data Protection Act 2023. A pipeline that collapses them into one label loses the information needed to apply those rules correctly — for retention, for access control, for sharing decisions, and for audit.
The broader case
Nigerian fintech, lending, and identity-verification workflows generate a lot of PII-heavy text: KYC notes, compliance reviews, fraud-investigation logs, OCR output from NIMC slips and bank statements, customer-service tickets quoting NUBAN account numbers and phone numbers in local formats. Generic detectors catch the easy cases (names, emails) but tend to either miss or over-collapse the locally important entities — exactly the failure mode shown above.
The adapter inherits the properties that make the base model practical for production — bidirectional token classification, single-pass span decoding, on-device execution — but with the label vocabulary and training data tuned for entities that matter in Nigerian text.
What's in the release
The repository includes a CLI runner, a FastAPI wrapper, deterministic span post-processing, LoRA fine-tuning and eval-only workflows, model and dataset cards, a small synthetic example dataset for schema inspection, and documentation for publishing and evaluating adapters.
Supported labels
Identity documents (new in this adapter):
private_ninprivate_bvnprivate_passport_numberprivate_drivers_license_numberprivate_voters_card_number
Financial:
account_number
Contact and identity:
private_personprivate_emailprivate_phoneprivate_address
Other:
private_dateprivate_urlsecret
Evaluation
The adapter was evaluated on a private mixed dataset combining synthetic examples, OCR-derived samples from Nigerian identity documents, and real-world domain text. Sensitive fields in the source data were annotated for evaluation and then redacted before any further use; source materials and derived artifacts are not redistributed.
Validation and test results on this dataset are strong, but the adapter is intentionally recall-oriented. The hard-negative challenge split — text containing benign identifier-like numbers such as invoice IDs and internal reference codes — shows the model still over-redacts in some cases. For precision-sensitive use, downstream users should add filters, tune thresholds, or fine-tune further on their own representative data.
Research preview, not a compliance product
This release adopts the framing of the upstream openai/privacy-filter release: it is a redaction and data-minimization aid, not an anonymization, compliance, or safety guarantee. It should not be used as the only control for legal, medical, financial, regulatory, or otherwise irreversible privacy decisions.
Any production deployment should include representative local evaluation, policy review, access controls, logging discipline, and human review where mistakes could cause harm.
Feedback welcome
We're particularly interested in:
- Nigerian-domain eval examples we can incorporate into public synthetic evals
- Hard-negative cases (benign numbers that look like IDs)
- OCR and document-layout examples, especially noisy NIMC and bank-statement output
- Label-boundary disagreements — places where you'd label something differently than we did
- False positives on benign numeric IDs
- Additional local formats that should be represented
If you try it, open an issue on the GitHub repository and share what worked and what failed. We read everything.