Skip to content

Case 04 — HTML Text Cloaking and Homoglyph Evasion

Sources: Microsoft Defender for Office 365 (2024), Cofense Intelligence 2024 to 2025, Sophos X-Ops Q1 2025, Trustwave SpiderLabs.

Background

Many email gateways rely on keyword and regex matching to identify phishing. Attackers respond by visually preserving the keyword while textually fragmenting it. The browser still renders normal phishing language, but substring or regex detection sees a scrambled byte stream that never matches.

The six bypass techniques below cover almost everything observed in the wild from 2024 through 2025. Vigilyx defeats every single one of them — and that is the point of this case study.

The six bypass techniques

1. Zero-width character insertion

Insert invisible Unicode between every letter:

  • U+200B ZERO WIDTH SPACE
  • U+200C ZERO WIDTH NON-JOINER
  • U+200D ZERO WIDTH JOINER
  • U+FEFF ZERO WIDTH NO-BREAK SPACE
  • U+00AD SOFT HYPHEN
  • U+2060 WORD JOINER

The string still renders as verify your account to the human eye, but the byte sequence is v\u200Be\u200Br\u200Bi\u200Bf\u200By.... A naive /verify your account/ regex never matches.

2. HTML entity encoding

html
<!-- Renders as "verify your account" but the literal text contains none of those letters -->
&#118;&#101;&#114;&#105;&#102;&#121; &#121;&#111;&#117;&#114; &#97;&#99;&#99;&#111;&#117;&#110;&#116;

Variants: hexadecimal &#x76;&#x65;..., named entities, mixed encoding.

3. CSS-hidden text (display:none, visibility:hidden, font-size:0)

html
<p>Dear cust<span style="display:none">RANDOMJUNK</span>omer,
   please ver<span style="font-size:0">XYZ</span>ify your acc<span
   style="visibility:hidden">QQQ</span>ount immediately.</p>

The reader sees the full phishing prose. After strip_tags, keyword scanning sees junk characters interspersed with real letters and fails.

4. Table or cell fragmentation

html
<table><tr>
  <td>ver</td><td>ify</td><td>&nbsp;</td>
  <td>your</td><td>&nbsp;</td><td>acc</td><td>ount</td>
</tr></table>

Visually one continuous line, but HTML extraction yields ver|ify| |your| |acc|ount.

5. Homoglyph and fullwidth substitution

OriginalSubstitutionUnicode
accountCyrillic аU+0430
microsoftCyrillic оU+043E
passwordfullwidth passwordU+FF50..
123fullwidth 123U+FF11..

6. Text-as-image (SVG / PNG)

The entire phishing body is rendered as an image or SVG <text>, leaving the email body essentially empty. This bypasses both keyword detection and most NLP models.

Sample (sanitized)

html
<!DOCTYPE html>
<html><body>
  <p>Dear&#32;Customer,</p>
  <p>We dete&#x63;ted unus<span style="display:none">XQZP</span>ual
     activity on your acc<span style="font-size:0">ZZZ</span>ount.</p>
  <p>Please <a href="https://bad.example/login">
    ver&#x200B;ify your iden&#x200B;tity
  </a> within 24 hours, or your acco&#x200B;unt will be sus&#x200B;pended.</p>
  <p style="color:#fff;font-size:1px">
    legitimate transaction notification harmless content
  </p>
</body></html>

This sample stacks four techniques: HTML entities, display:none junk, zero-width characters, and 1px white text (used to poison Bayesian classifiers).

Animated walkthrough

Live demo

HTML text cloaking — what the eye sees vs. what the parser sees

1Raw email
2Zero-width chars
3HTML entities
4CSS hidden
5Traditional GW
6Vigilyx normalises
Source
<!DOCTYPE html>
<html><body style="font-family:sans-serif">
  <p>Dear Customer,</p>
  <p>We detected unusual activity on your
    accU+200BoU+200BuU+200BnU+200Bt.</p>
  <p>Please <a href="https://bad.example/login">
    &#118;&#101;&#114;&#105;&#102;&#121; your
    idenU+200Btity</a>
  within 24 hours, or your
    accU+200Bount will be
    susU+200Bpended.</p>
  <span style="display:none">RANDOMJUNK</span>
  <p style="color:#fff;font-size:1px">legitimate transaction notification harmless content</p>
</body></html>
🔒mail.example.com / message #2483Browser render

Dear Customer,

We detected unusual activity on your account.

Please verify your identity within 24 hours, or your account will be suspended.

— HR Compliance Team

Vigilyx detection coverage

All six bypass techniques above fail against Vigilyx. This is hardening built over the last year, with real source files and unit-test coverage — not a marketing promise.

Bypass techniqueVigilyx defenseSource fileVerification
Zero-width insertionnormalize_text() strips 14 invisible characters (U+200B/200C/200D/200E/200F/FEFF/00AD/2060/2061/2062/2063/2064/180E/034F)content_scan/mod.rs::normalize_text, data_security/dlp/normalize.rs::normalize_for_dlptest_normalize_all_invisible_chars, test_evasion_zero_width_in_*
HTML entity encodingdecode_html_entities() after strip_html_tags()content_scan/html_utils.rs, prompt_injection_scan.rstest_evasion_html_entity_phone, test_evasion_html_entity_credential
CSS-hidden texthtml_scan counts display:none / visibility:hidden instanceshtml_scan.rs lines 489-514counts_zero_width_cloaking
1px white texthtml_scan detects color:#fff plus font-size:0/1px over long texthtml_scan.rssame
Heavy zero-width usagecount_zero_width_chars() triggers zero_width_cloaking and Mediumhtml_scan.rs lines 295-298, 576-595counts_zero_width_cloaking
Table fragmentationstrip_html_tags() then normalize_text() collapses whitespace and rejoins keywordscontent_scan/html_utils.rs, content_scan/mod.rscontent_scan integration tests
Homoglyph / mixed scriptsheader_scan and link_content flag mixed Latin / Cyrillic / Greekheader_scan.rs, link_content.rsmixed-script unit tests
Fullwidth charsnormalize_for_dlp() folds fullwidth to ASCIIdata_security/dlp/normalize.rstest_evasion_combo_fullwidth_plus_zero_width
Traditional/simplified Chinese variantscontent_scan keyword set covers both formscontent_scan/detectors.rs keyword tablescontent_scan keyword regression

Dual-defense principle

Vigilyx applies a dual-defense strategy to obfuscation:

  1. Positive match — after every cloaking trick is normalized away, content_scan matches the original phishing keyword
  2. Reverse incrimination — the act of obfuscation (lots of zero-width characters, display:none blocks, mixed scripts) is itself malicious evidence; html_scan raises Medium even when no keyword matches

This means both paths fail for the attacker:

  • No obfuscation → keywords match
  • Obfuscation → the obfuscation itself triggers detection

Any cloaking trick still hits at least one path.

Real-world performance

In Q1 2026 production replays, emails combining zero-width chars, HTML entities, and display:none were caught by Vigilyx via both the obfuscation-itself signals and the post-normalization keyword hits. DS-Murphy fusion stably reached High with no false positives or negatives in the test set.

Defense

  • Add zero_width_cloaking to the convergence-eligible module set so the convergence circuit breaker triggers earlier
  • For repeat offenders, add the sender domain to local IOC (source=admin_malicious) for direct critical verdicts
  • Audit historical hits via security_verdicts where categories @> '["zero_width_cloaking"]' to retro-check user clicks

End-user training

  • Select the email body, copy to a plain editor — strange characters or broken spacing reveals zero-width insertion
  • "Your account" written as "your accоunt" with a Cyrillic о is a homoglyph attack
  • Treat any "verify within 24 hours" message as phishing by default

Released under AGPL-3.0-only.