Case 04 — HTML Text Cloaking and Homoglyph Evasion
Sources: Microsoft Defender for Office 365 (2024), Cofense Intelligence 2024 to 2025, Sophos X-Ops Q1 2025, Trustwave SpiderLabs.
Background
Many email gateways rely on keyword and regex matching to identify phishing. Attackers respond by visually preserving the keyword while textually fragmenting it. The browser still renders normal phishing language, but substring or regex detection sees a scrambled byte stream that never matches.
The six bypass techniques below cover almost everything observed in the wild from 2024 through 2025. Vigilyx defeats every single one of them — and that is the point of this case study.
The six bypass techniques
1. Zero-width character insertion
Insert invisible Unicode between every letter:
U+200BZERO WIDTH SPACEU+200CZERO WIDTH NON-JOINERU+200DZERO WIDTH JOINERU+FEFFZERO WIDTH NO-BREAK SPACEU+00ADSOFT HYPHENU+2060WORD JOINER
The string still renders as verify your account to the human eye, but the byte sequence is v\u200Be\u200Br\u200Bi\u200Bf\u200By.... A naive /verify your account/ regex never matches.
2. HTML entity encoding
<!-- Renders as "verify your account" but the literal text contains none of those letters -->
verify your accountVariants: hexadecimal ve..., named entities, mixed encoding.
3. CSS-hidden text (display:none, visibility:hidden, font-size:0)
<p>Dear cust<span style="display:none">RANDOMJUNK</span>omer,
please ver<span style="font-size:0">XYZ</span>ify your acc<span
style="visibility:hidden">QQQ</span>ount immediately.</p>The reader sees the full phishing prose. After strip_tags, keyword scanning sees junk characters interspersed with real letters and fails.
4. Table or cell fragmentation
<table><tr>
<td>ver</td><td>ify</td><td> </td>
<td>your</td><td> </td><td>acc</td><td>ount</td>
</tr></table>Visually one continuous line, but HTML extraction yields ver|ify| |your| |acc|ount.
5. Homoglyph and fullwidth substitution
| Original | Substitution | Unicode |
|---|---|---|
account | Cyrillic а | U+0430 |
microsoft | Cyrillic о | U+043E |
password | fullwidth password | U+FF50.. |
123 | fullwidth 123 | U+FF11.. |
6. Text-as-image (SVG / PNG)
The entire phishing body is rendered as an image or SVG <text>, leaving the email body essentially empty. This bypasses both keyword detection and most NLP models.
Sample (sanitized)
<!DOCTYPE html>
<html><body>
<p>Dear Customer,</p>
<p>We detected unus<span style="display:none">XQZP</span>ual
activity on your acc<span style="font-size:0">ZZZ</span>ount.</p>
<p>Please <a href="https://bad.example/login">
ver​ify your iden​tity
</a> within 24 hours, or your acco​unt will be sus​pended.</p>
<p style="color:#fff;font-size:1px">
legitimate transaction notification harmless content
</p>
</body></html>This sample stacks four techniques: HTML entities, display:none junk, zero-width characters, and 1px white text (used to poison Bayesian classifiers).
Animated walkthrough
HTML text cloaking — what the eye sees vs. what the parser sees
<!DOCTYPE html>
<html><body style="font-family:sans-serif">
<p>Dear Customer,</p>
<p>We detected unusual activity on your
accU+200BoU+200BuU+200BnU+200Bt.</p>
<p>Please <a href="https://bad.example/login">
verify your
idenU+200Btity</a>
within 24 hours, or your
accU+200Bount will be
susU+200Bpended.</p>
<span style="display:none">RANDOMJUNK</span>
<p style="color:#fff;font-size:1px">legitimate transaction notification harmless content</p>
</body></html>Dear Customer,
We detected unusual activity on your account.
Please verify your identity within 24 hours, or your account will be suspended.
— HR Compliance Team
Vigilyx detection coverage
All six bypass techniques above fail against Vigilyx. This is hardening built over the last year, with real source files and unit-test coverage — not a marketing promise.
| Bypass technique | Vigilyx defense | Source file | Verification |
|---|---|---|---|
| Zero-width insertion | normalize_text() strips 14 invisible characters (U+200B/200C/200D/200E/200F/FEFF/00AD/2060/2061/2062/2063/2064/180E/034F) | content_scan/mod.rs::normalize_text, data_security/dlp/normalize.rs::normalize_for_dlp | test_normalize_all_invisible_chars, test_evasion_zero_width_in_* |
| HTML entity encoding | decode_html_entities() after strip_html_tags() | content_scan/html_utils.rs, prompt_injection_scan.rs | test_evasion_html_entity_phone, test_evasion_html_entity_credential |
| CSS-hidden text | html_scan counts display:none / visibility:hidden instances | html_scan.rs lines 489-514 | counts_zero_width_cloaking |
| 1px white text | html_scan detects color:#fff plus font-size:0/1px over long text | html_scan.rs | same |
| Heavy zero-width usage | count_zero_width_chars() triggers zero_width_cloaking and Medium | html_scan.rs lines 295-298, 576-595 | counts_zero_width_cloaking |
| Table fragmentation | strip_html_tags() then normalize_text() collapses whitespace and rejoins keywords | content_scan/html_utils.rs, content_scan/mod.rs | content_scan integration tests |
| Homoglyph / mixed scripts | header_scan and link_content flag mixed Latin / Cyrillic / Greek | header_scan.rs, link_content.rs | mixed-script unit tests |
| Fullwidth chars | normalize_for_dlp() folds fullwidth to ASCII | data_security/dlp/normalize.rs | test_evasion_combo_fullwidth_plus_zero_width |
| Traditional/simplified Chinese variants | content_scan keyword set covers both forms | content_scan/detectors.rs keyword tables | content_scan keyword regression |
Dual-defense principle
Vigilyx applies a dual-defense strategy to obfuscation:
- Positive match — after every cloaking trick is normalized away,
content_scanmatches the original phishing keyword - Reverse incrimination — the act of obfuscation (lots of zero-width characters,
display:noneblocks, mixed scripts) is itself malicious evidence;html_scanraises Medium even when no keyword matches
This means both paths fail for the attacker:
- No obfuscation → keywords match
- Obfuscation → the obfuscation itself triggers detection
Any cloaking trick still hits at least one path.
Real-world performance
In Q1 2026 production replays, emails combining zero-width chars, HTML entities, and display:none were caught by Vigilyx via both the obfuscation-itself signals and the post-normalization keyword hits. DS-Murphy fusion stably reached High with no false positives or negatives in the test set.
Defense
- Add
zero_width_cloakingto the convergence-eligible module set so the convergence circuit breaker triggers earlier - For repeat offenders, add the sender domain to local IOC (
source=admin_malicious) for direct critical verdicts - Audit historical hits via
security_verdictswherecategories @> '["zero_width_cloaking"]'to retro-check user clicks
End-user training
- Select the email body, copy to a plain editor — strange characters or broken spacing reveals zero-width insertion
- "Your account" written as "your accоunt" with a Cyrillic о is a homoglyph attack
- Treat any "verify within 24 hours" message as phishing by default
