Teaching privacy models to understand Nigerian data.
A LoRA adapter on OpenAI's Privacy Filter base model that adds five Nigerian identity-document labels — NIN, BVN, passport, driver's license, and voter's card — on top of the eight generic categories the base model already covers, and adapts the rest of the taxonomy for Nigerian formats.
Why a Nigerian adapter?
OpenAI's Privacy Filter performs context-aware detection on unstructured text — reading the surrounding language and entity co-occurrence to type each span — so two numbers with identical formats can resolve to different labels depending on how each is referenced. Heuristic detectors can't make that call.
Banks, fintechs, lending, and identity-verification workflows generate PII-heavy text — KYC notes, compliance reviews, fraud-investigation logs, OCR output from NIMC slips and bank statements, customer-service tickets quoting NUBAN account numbers and phone numbers in local formats.
A generic model catches the easy cases but tends to miss orover-collapse the locally important entities — losing information you need for retention, access control, sharing decisions, and audit.
“My name is Ciroma John. I live at No. 11 Yaba street Ikeja. Review completed with no further action required at this stage; the record was checked against 22443690465 and 34488606925 for consistency with the supporting documentation.”
{
"entities": [
{ "entity_group": "private_person", "word": "Ciroma John", "score": 1.0 },
{ "entity_group": "private_address", "word": "No. 11 Yaba street Ikeja", "score": 1.0 },
{ "entity_group": "account_number", "word": "22443690465", "score": 1.0 },
{ "entity_group": "account_number", "word": "34488606925", "score": 1.0 }
]
}{
"entities": [
{ "entity_group": "private_person", "word": "Ciroma John", "score": 1.00 },
{ "entity_group": "private_address", "word": "No. 11 Yaba street Ikeja", "score": 0.998 },
{ "entity_group": "private_bvn", "word": "22443690465", "score": 0.991 },
{ "entity_group": "private_nin", "word": "34488606925", "score": 1.00 }
]
}The base model labels both 11-digit numbers as a generic account_number. The adapter resolves them as a BVN and a NIN — the labels you actually need.
Under the Nigeria Data Protection Act 2023, they sit in different regulatory regimes with different handling expectations for retention, access control, sharing decisions, and audit. A pipeline that collapses them into one label loses the information needed to apply those rules correctly.
What it detects
| Entity Label | Description | Example Match |
|---|---|---|
| private_person | Person names and name-like references | Amina Yusuf |
| private_phone | Nigerian local and international phone formats | +234 802 111 3344 |
| private_email | Email addresses | amina.yusuf@example.ng |
| private_address | Street, city, state, or postal addresses | 42 Unity Road, Ikeja, Lagos 100271 |
| private_date | Dates tied to a person, record, or event | 12 April 1988 |
| private_bvn | Bank Verification Number references and values | 22334455667 |
| private_nin | National Identification Number references and values | 12345678901 |
| account_number | NUBAN-style Nigerian bank account numbers | 6318826391 |
| private_passport_number | Nigerian passport identifiers | B05995318 |
| private_drivers_license_number | Nigerian driver license identifiers | K2BHY7F6FEA0 |
| private_voters_card_number | Nigerian voter card identifiers | ABCD 1234 5678 9012 345 |
| private_url | URLs tied to private records or workflows | https://claims.example/record/1234 |
| secret | Credentials, authorization codes, session tokens | S3cure!9037Ops |
Performance
The adapter was evaluated on a private mixed dataset combining synthetic examples, OCR-derived samples from Nigerian identity documents, and real-world domain text. Sensitive fields in the source data were annotated for evaluation and then redacted before any further use; source materials and derived artifacts are not redistributed.
Recall-oriented (v0.1 research preview)
Validation and test results on this dataset are strong, but the adapter is intentionally recall-oriented. The hard-negative challenge split — text containing benign identifier-like numbers such as invoice IDs and internal reference codes — shows the model still over-redacts in some cases. For precision-sensitive use, downstream users should add filters, tune thresholds, or finetune further on their own representative data.
Challenge split is hard-negative-only — 0.72 false-positive example rate across 250 examples. Typed F1 is not a meaningful metric on this split.
How to use
CLI Usage
Sync deps and run the published Naija LoRA adapter on top of OpenAI's Privacy Filter base model.
uv sync
uv run python main.py \
--model-name openai/privacy-filter \
--adapter-name iamSamurai/privacy-filter-nigeria \
"My name is Harry Potter and my email is harry.potter@hogwarts.edu."REST API
Serve predictions over HTTP. The base model and adapter are env-driven so you can swap checkpoints without code changes.
PRIVACY_FILTER_MODEL_NAME=openai/privacy-filter \
PRIVACY_FILTER_ADAPTER_NAME=iamSamurai/privacy-filter-nigeria \
uv run uvicorn api:app --reloadcurl -X POST http://127.0.0.1:8000/predict \
-H "Content-Type: application/json" \
-d '{"text":"Amina Yusuf can be reached at +234 802 111 3344.","mode":"cleaned"}'Future Work
More local formats support
Include additional local formats that should be represented.
Per-label evaluation
Track per-label precision, recall, F1, boundary accuracy, and false-positive rates on numeric IDs and generic locations as separate signals — not a single F1 number.
Expanded OOD eval
Extend the private mixed eval with broader OOD coverage — mixed naming conventions, non-Lagos addresses, code-switched text, and multi-ID records — and publish a reproducible OOD slice.
Hybrid postprocessing
Pair the model with regex and recognizer postprocessing for deterministic entity types (emails, URLs, phone numbers, structured IDs) rather than relying on the model alone.