Dagger — Detection Engine

Dagger is Sonomos’s PII detection engine. It uses a combination of advanced pattern analysis and on-device AI to identify sensitive data across 62+ categories — all running locally in your browser.

How detection works

When text appears on a page — whether typed, pasted, or loaded — Dagger processes it through multiple detection layers in sequence:

Tokenization & normalization — text is split, normalized for whitespace and unicode, and prepared for matching.
Pattern analysis — deterministic matchers run first because they’re fast and produce few false positives.
AI-powered recognition — an on-device language model scans for unstructured PII the patterns can’t catch.
Validation & deduplication — overlapping matches are merged, validators confirm structured matches (e.g. Luhn for credit cards), and very-short / common-word matches are filtered.
Severity scoring — each surviving match is assigned a severity tier that drives the risk widget color and Cloak’s behavior.

Pattern analysis

Deterministic matchers built for structured PII formats. These are fast, precise, and include built-in validation (such as checksum verification for financial identifiers). Examples include:

Social Security numbers (with format validation)
Credit card numbers (with Luhn checksum)
Email addresses
Phone numbers (US and international)
IP addresses (IPv4 and IPv6)
Dates of birth
Medical record numbers
Driver’s license numbers (state-specific patterns)
Passport numbers
IBANs and SWIFT/BIC codes

Pattern matchers run synchronously and effectively instantaneously, so they don’t add perceptible latency.

AI-powered recognition

On-device small language models identify unstructured PII that pattern analysis alone can’t catch:

Person names in free text (with disambiguation against common words)
Organization names
Location references (cities, regions, facilities)
Context-dependent identifiers (e.g. “the patient”, “Mr. Smith’s account”)

All AI models run locally on your device — no data is sent to external servers for analysis. The model is bundled with the extension; no remote inference call is ever made.

Detection categories

Dagger currently includes 62 detectors with comprehensive test coverage. Each detector classifies matches into severity tiers:

Severity	Color	Examples	Default Cloak behavior
High	🔴 Red	SSN, credit card, passport, medical record number, DEA number	Auto-mask
Medium	🟡 Amber	Full name, date of birth, address, phone number	Auto-mask on supported sites
Low	🟢 Green	Email, public IP address, organization name	Mask if Cloak strict mode

Severity drives the risk widget color and determines whether Cloak auto-masks or simply flags. See the full list in the Detector Reference.

Image and document detection

For PII embedded in images (screenshots, scanned documents), Dagger uses optical character recognition to extract text before running it through the same detection pipeline. PDF content is also parsed and analyzed automatically.

Source	Handling
Inline images	OCR → detection pipeline
Inline PDFs	Text extraction → detection pipeline
Drag-and-drop file	Scanned on the page where the drop target receives it
Screenshots	Not supported in the extension. Coming in Sonomos Desktop.

Tuning and false positives

A few things to know about how Dagger behaves in practice:

Short common words (3 characters or fewer, words from a stop-list) that the model flags as names are excluded from Cloak masking automatically. They still appear in the detection panel so you can audit, but they will not break your prompt.
Validator-backed detectors (SSNs, credit cards, IBANs, NPI numbers) reject matches that fail the appropriate checksum, dramatically reducing noise.
Per-page rescans are debounced — if you’re typing quickly, Dagger waits for a brief pause before re-evaluating to keep CPU usage low.

Known limitations

The prototype has known sharp edges we’re tracking:

ZIP code edge cases — certain street-number-plus-ZIP formats can be misclassified. An improved validator is in development.
Short-token false positives — very short common words may occasionally be flagged as names. They’re excluded from Cloak masking automatically.
Non-Latin scripts — detection quality is strongest in English. Multilingual coverage is improving and will expand significantly in Sonomos Desktop, which can run larger multilingual models.
Browser-only surface — Dagger only sees what’s on the web page. Native AI clients, IDEs, and the system clipboard are out of scope until Sonomos Desktop ships.

Reporting detection issues

False positives or false negatives are extremely useful — they’re the fastest way to improve the engine. Open the Sonomos popup, click Send Feedback, and include a description (don’t paste real PII; describe the format). Reports go directly to our team at info@sonomos.ai.