Dagger — Detection Engine
Dagger is Sonomos’s PII detection engine. It uses a combination of advanced pattern analysis and on-device AI to identify sensitive data across 62+ categories — all running locally in your browser.
How detection works
When text appears on a page — whether typed, pasted, or loaded — Dagger processes it through multiple detection layers in sequence:
- Tokenization & normalization — text is split, normalized for whitespace and unicode, and prepared for matching.
- Pattern analysis — deterministic matchers run first because they’re fast and produce few false positives.
- AI-powered recognition — an on-device language model scans for unstructured PII the patterns can’t catch.
- Validation & deduplication — overlapping matches are merged, validators confirm structured matches (e.g. Luhn for credit cards), and very-short / common-word matches are filtered.
- Severity scoring — each surviving match is assigned a severity tier that drives the risk widget color and Cloak’s behavior.
Pattern analysis
Deterministic matchers built for structured PII formats. These are fast, precise, and include built-in validation (such as checksum verification for financial identifiers). Examples include:
- Social Security numbers (with format validation)
- Credit card numbers (with Luhn checksum)
- Email addresses
- Phone numbers (US and international)
- IP addresses (IPv4 and IPv6)
- Dates of birth
- Medical record numbers
- Driver’s license numbers (state-specific patterns)
- Passport numbers
- IBANs and SWIFT/BIC codes
Pattern matchers run synchronously and effectively instantaneously, so they don’t add perceptible latency.
AI-powered recognition
On-device small language models identify unstructured PII that pattern analysis alone can’t catch:
- Person names in free text (with disambiguation against common words)
- Organization names
- Location references (cities, regions, facilities)
- Context-dependent identifiers (e.g. “the patient”, “Mr. Smith’s account”)
All AI models run locally on your device — no data is sent to external servers for analysis. The model is bundled with the extension; no remote inference call is ever made.
Detection categories
Dagger currently includes 62 detectors with comprehensive test coverage. Each detector classifies matches into severity tiers:
| Severity | Color | Examples | Default Cloak behavior |
|---|---|---|---|
| High | 🔴 Red | SSN, credit card, passport, medical record number, DEA number | Auto-mask |
| Medium | 🟡 Amber | Full name, date of birth, address, phone number | Auto-mask on supported sites |
| Low | 🟢 Green | Email, public IP address, organization name | Mask if Cloak strict mode |
Severity drives the risk widget color and determines whether Cloak auto-masks or simply flags. See the full list in the Detector Reference.
Image and document detection
For PII embedded in images (screenshots, scanned documents), Dagger uses optical character recognition to extract text before running it through the same detection pipeline. PDF content is also parsed and analyzed automatically.
| Source | Handling |
|---|---|
| Inline images | OCR → detection pipeline |
| Inline PDFs | Text extraction → detection pipeline |
| Drag-and-drop file | Scanned on the page where the drop target receives it |
| Screenshots | Not supported in the extension. Coming in Sonomos Desktop. |
Tuning and false positives
A few things to know about how Dagger behaves in practice:
- Short common words (3 characters or fewer, words from a stop-list) that the model flags as names are excluded from Cloak masking automatically. They still appear in the detection panel so you can audit, but they will not break your prompt.
- Validator-backed detectors (SSNs, credit cards, IBANs, NPI numbers) reject matches that fail the appropriate checksum, dramatically reducing noise.
- Per-page rescans are debounced — if you’re typing quickly, Dagger waits for a brief pause before re-evaluating to keep CPU usage low.
Known limitations
The prototype has known sharp edges we’re tracking:
- ZIP code edge cases — certain street-number-plus-ZIP formats can be misclassified. An improved validator is in development.
- Short-token false positives — very short common words may occasionally be flagged as names. They’re excluded from Cloak masking automatically.
- Non-Latin scripts — detection quality is strongest in English. Multilingual coverage is improving and will expand significantly in Sonomos Desktop, which can run larger multilingual models.
- Browser-only surface — Dagger only sees what’s on the web page. Native AI clients, IDEs, and the system clipboard are out of scope until Sonomos Desktop ships.
Reporting detection issues
False positives or false negatives are extremely useful — they’re the fastest way to improve the engine. Open the Sonomos popup, click Send Feedback, and include a description (don’t paste real PII; describe the format). Reports go directly to our team at info@sonomos.ai.