Question 1

What is the single most common invisible character on the internet?

Accepted Answer

Zero-width space (U+200B) wins by a wide margin. It sneaks in from web pages that use it for word-break hints, from terminal output, and from AI chat interfaces. Because it is zero pixels wide it looks like nothing, which is why breakage is so disorienting - everything looks right until something expecting exact equality refuses to match.

Question 2

Will cleaning remove my emojis?

Accepted Answer

Emoji base characters are regular letter-like code points and are preserved. The wrinkle is zero-width joiners inside composed sequences - like family-of-four or rainbow flag - which are technically invisible. The tool flags them so you know, but for emoji-safe cleaning uncheck ZWJ categories before applying clean.

Question 3

Is the UTF-8 BOM actually invisible?

Accepted Answer

On screen yes, but semantically it is a byte-order mark (U+FEFF) that some tools interpret as the first character of the stream. Notepad on Windows happily prepends it to every file; most Unix tools choke on it. The detector always flags it at offset 0 when present. Stripping it before shipping content to Unix shell scripts, JSON parsers, or HTTP responses saves hours of debugging.

Question 4

Does the tool ever send my text anywhere?

Accepted Answer

No. The detector is a synchronous JavaScript function that runs inside the page. There is no fetch to any API, no WebSocket, and no background sync. You can disable your network connection after loading this page and every subsequent keystroke still produces a correct scan. The output table and preview are rendered directly into the DOM without round-tripping the content.

Question 5

Why does my code still behave strangely after cleaning?

Accepted Answer

Three common causes: (1) a homoglyph rather than an invisible character - Cyrillic er (U+0440) looks identical to Latin p (U+0070); this tool does not flag homoglyphs because they are technically visible. (2) Mixed line endings (CRLF vs LF) that this tool does not treat as invisible. (3) Non-breaking spaces left in place by default. For homoglyph attacks, a dedicated tool or a good code review is required; Unicode Technical Standard #39 describes the confusability data that underlies them.

Question 6

Does the tool handle supplementary-plane characters?

Accepted Answer

Yes. Iteration uses Array.from which splits surrogate pairs correctly into their single code point, so a character like the Deseret Long I (U+10400) is handled as one unit rather than two. That matters because many invisible characters in the supplementary planes (tag characters U+E0000-U+E007F, for example) are used by some homoglyph attacks and would be missed by a naive str[i] loop.

Question 7

What are tag characters (U+E0000 range)?

Accepted Answer

Tag characters are a deprecated Unicode mechanism that was repurposed in 2022 for emoji flag sequences (like Scotland or England subdivision flags) and more recently weaponized in prompt-injection attacks against LLMs. They encode invisible metadata that renders as nothing but changes how downstream tools interpret the string. The detector flags the entire U+E0000-U+E007F block so you can spot hostile prompts before pasting them into an AI application.

Question 8

Can I use this to audit a file uploaded to my SaaS before ingestion?

Accepted Answer

The browser version is one pasted file at a time. For programmatic auditing at upload time, use a regex-based scan in your backend language of choice, for example /[\u200B-\u200F\u202A-\u202E\u2060-\u206F\uFEFF]/. Node's String.prototype.normalize plus a custom allow-list or the npm strip-invisible-characters package both work well at scale. Treat this tool as the interactive, one-off counterpart to that backend scan.

Question 9

What is a Trojan Source attack?

Accepted Answer

Trojan Source (CVE-2021-42574) is a 2021 Cambridge discovery where Unicode bidi control characters reorder source so it reads one way to a human but executes another. The detector flags U+202A through U+202E. Modern compilers and Git now warn about bidi in source files.

Question 10

Does removing invisible characters affect RTL languages?

Accepted Answer

It can. Arabic, Hebrew, Persian, and Urdu text often contains legitimate LRM and RLM marks to disambiguate directional context around numbers or Latin insertions. Blindly stripping those may not break anything visible immediately but can produce incorrect directional rendering in edge cases. When cleaning RTL text, spot-check the result with a native reader or keep the bidi marks and only strip categories that are obviously malicious (tag characters, soft hyphens, BOM).

Question 11

Is there an equivalent CLI for scripted cleaning?

Accepted Answer

Yes. iconv -f UTF-8 -t UTF-8 -c strips invalid sequences; sed -i 's/\xe2\x80\x8b//g' removes U+200B; tr -d handles ASCII control characters. For scale, Python's unicodedata plus General_Category filtering is the standard approach.

Invisible Character Detector

How to Use the Invisible Character Detector

Under the Hood

Why You Would Run Text Through This

Common Pitfalls and Edge Cases

Unicode Category Background

Comparison to Alternatives

Frequently Asked Questions

Related tools

More Text Tools

Binary to Text

Case Converter

Character Counter

Emoji Picker & Search

Fancy Text Generator

Find & Replace