Invisible Character Detector
Find and remove hidden zero-width and invisible characters from text.
Reviewed by Aygul Dovletova · Last reviewed
How to Use the Invisible Character Detector
- Paste your suspicious text into the large input area. The scan runs automatically on every keystroke, so you do not need to click a button.
- Read the summary card at the top. It lists how many invisible characters were found, broken down by Unicode category (zero-width, bidi control, spaces, BOM, separators).
- Open the details table to see each occurrence with its code point, Unicode name, and byte offset into the input, so you can jump to the exact spot in your editor.
- Scroll the highlighted preview below the table. Every hidden character is replaced with a colored tag such as
[U+200B]so you can visually locate the intrusion. - Click Clean Text to strip every detected character in place, or Copy Cleaned Text if you prefer to keep the original in the textarea untouched.
- Paste the cleaned result back into your editor, commit, or paste wherever the malformed text originated.
Under the Hood
The detector walks the input with Array.from(str) so it iterates over Unicode code points rather than UTF-16 code units, then checks each code point against a curated set of categories: Cf (format control), Cc (control), Zs (space separators other than U+0020), Zl (line separator), Zp (paragraph separator), and a specific allow-list of zero-width joiners, variation selectors, and bidi marks. Decisions are driven by the code point itself, not the character's visual rendering, so glyphs missing in your system font do not cause false positives. The highlighting pass replaces each detected character with a <span> tag using DOM text nodes rather than innerHTML, which means the preview cannot accidentally inject executable markup even for pathological inputs.
Why You Would Run Text Through This
- A copy-pasted shell command keeps failing with "command not found" because a zero-width space hides between
gitandstatus. - Your programming language's string comparison returns false for two values that look identical on screen.
- You pasted a password from a password manager and authentication fails because a bidirectional mark rode along.
- A JSON file fails to parse with a cryptic "Unexpected token" error at column 1 because of a UTF-8 BOM.
- You received a Word document export that contains soft hyphens (U+00AD) that break grep and line-wrap mid-word.
- You are auditing an AI-generated response to check for homoglyph attacks or Trojan Source-style hidden control characters before shipping it to production.
Common Pitfalls and Edge Cases
- ZWJ emoji sequences. The family-of-four emoji joins four people with zero-width joiners (U+200D). Those are intentional and removing them breaks the composed glyph. The detector flags them but you should keep them.
- Variation selectors. U+FE0F changes a base character into its emoji presentation (like the red heart). Stripping it can turn colorful emoji back into plain text glyphs.
- Arabic and Hebrew bidi marks. U+200E (LRM) and U+200F (RLM) can be legitimate in right-to-left text. Blindly removing them can break sentence ordering.
- The UTF-8 BOM (U+FEFF). Harmless in most text editors but will break shebang parsing (
#!/bin/sh), JSON parsers, and HTTP Content-Type sniffing when it appears at the start of a file. - Non-breaking space (U+00A0). Visually identical to a normal space but is not recognized by
\\sin some regex flavors and fails to split withstr.split(' '). - Tabs and regular spaces are not flagged, because they are visible whitespace. Use the Whitespace Remover tool if you need to strip them.
Unicode Category Background
Every Unicode code point is assigned a General_Category property that groups it into classes such as Letter, Number, Punctuation, Symbol, Mark, Separator, or Other. The "invisible" class is not a single category - it is an informal union drawn from several: Cf (format characters like zero-width joiners and bidi marks), Cc (control codes like NULL and the shell beep), Zl and Zp (line and paragraph separators U+2028 and U+2029), and many of the Zs space characters beyond U+0020 (en space, em space, hair space, mongolian vowel separator). Unicode Technical Standard #39 (Unicode Security Mechanisms) and the Trojan Source research paper (CVE-2021-42574) document how these characters can be weaponized to sneak malicious code past human review. Unicode UAX #31, covering identifier syntax, recommends restricting programming-language identifiers to a conservative subset to prevent such attacks.
Comparison to Alternatives
On Linux you can spot invisible characters with cat -A, which converts non-printables to caret notation, or with hexdump -C for byte-level certainty. Editors such as VS Code, Sublime Text, and JetBrains IDEs have optional rendering of whitespace and bidi characters that reveals them inline. Many language linters (for example eslint-plugin-no-bidi and gitleaks) flag Trojan Source and homoglyph attacks at commit time. The npm package strip-invisible-characters or the Python unicodedata module let you script the cleanup at build time. Use this web detector when you have a snippet you received via chat or email and want an instant visual breakdown without opening a terminal or installing plugins - especially useful on a Chromebook, a phone, or a locked-down work laptop.
Frequently Asked Questions
What is the single most common invisible character on the internet?
Zero-width space (U+200B) wins by a wide margin. It sneaks in from web pages that use it for word-break hints, from terminal output, and from AI chat interfaces. Because it is zero pixels wide it looks like nothing, which is why breakage is so disorienting - everything looks right until something expecting exact equality refuses to match.
Will cleaning remove my emojis?
Emoji base characters are regular letter-like code points and are preserved. The wrinkle is zero-width joiners inside composed sequences - like family-of-four or rainbow flag - which are technically invisible. The tool flags them so you know, but for emoji-safe cleaning uncheck ZWJ categories before applying clean.
Is the UTF-8 BOM actually invisible?
On screen yes, but semantically it is a byte-order mark (U+FEFF) that some tools interpret as the first character of the stream. Notepad on Windows happily prepends it to every file; most Unix tools choke on it. The detector always flags it at offset 0 when present. Stripping it before shipping content to Unix shell scripts, JSON parsers, or HTTP responses saves hours of debugging.
Does the tool ever send my text anywhere?
No. The detector is a synchronous JavaScript function that runs inside the page. There is no fetch to any API, no WebSocket, and no background sync. You can disable your network connection after loading this page and every subsequent keystroke still produces a correct scan. The output table and preview are rendered directly into the DOM without round-tripping the content.
Why does my code still behave strangely after cleaning?
Three common causes: (1) a homoglyph rather than an invisible character - Cyrillic er (U+0440) looks identical to Latin p (U+0070); this tool does not flag homoglyphs because they are technically visible. (2) Mixed line endings (CRLF vs LF) that this tool does not treat as invisible. (3) Non-breaking spaces left in place by default. For homoglyph attacks, a dedicated tool or a good code review is required; Unicode Technical Standard #39 describes the confusability data that underlies them.
Does the tool handle supplementary-plane characters?
Yes. Iteration uses Array.from which splits surrogate pairs correctly into their single code point, so a character like the Deseret Long I (U+10400) is handled as one unit rather than two. That matters because many invisible characters in the supplementary planes (tag characters U+E0000-U+E007F, for example) are used by some homoglyph attacks and would be missed by a naive str[i] loop.
What are tag characters (U+E0000 range)?
Tag characters are a deprecated Unicode mechanism that was repurposed in 2022 for emoji flag sequences (like Scotland or England subdivision flags) and more recently weaponized in prompt-injection attacks against LLMs. They encode invisible metadata that renders as nothing but changes how downstream tools interpret the string. The detector flags the entire U+E0000-U+E007F block so you can spot hostile prompts before pasting them into an AI application.
Can I use this to audit a file uploaded to my SaaS before ingestion?
The browser version is one pasted file at a time. For programmatic auditing at upload time, use a regex-based scan in your backend language of choice, for example /[\u200B-\u200F\u202A-\u202E\u2060-\u206F\uFEFF]/. Node's String.prototype.normalize plus a custom allow-list or the npm strip-invisible-characters package both work well at scale. Treat this tool as the interactive, one-off counterpart to that backend scan.
What is a Trojan Source attack?
Trojan Source (CVE-2021-42574) is a 2021 Cambridge discovery where Unicode bidi control characters reorder source so it reads one way to a human but executes another. The detector flags U+202A through U+202E. Modern compilers and Git now warn about bidi in source files.
Does removing invisible characters affect RTL languages?
It can. Arabic, Hebrew, Persian, and Urdu text often contains legitimate LRM and RLM marks to disambiguate directional context around numbers or Latin insertions. Blindly stripping those may not break anything visible immediately but can produce incorrect directional rendering in edge cases. When cleaning RTL text, spot-check the result with a native reader or keep the bidi marks and only strip categories that are obviously malicious (tag characters, soft hyphens, BOM).
Is there an equivalent CLI for scripted cleaning?
Yes. iconv -f UTF-8 -t UTF-8 -c strips invalid sequences; sed -i 's/\xe2\x80\x8b//g' removes U+200B; tr -d handles ASCII control characters. For scale, Python's unicodedata plus General_Category filtering is the standard approach.
More Text Tools
Binary to Text
Convert text to binary and binary back to text.
Open toolCase Converter
Convert text between UPPER, lower, Title, Sentence, camelCase, snake_case and more.
Open toolCharacter Counter
Count characters with platform-specific limits for Twitter, Instagram and more.
Open toolEmoji Picker & Search
Search and copy emojis by name or category.
Open toolFancy Text Generator
Generate stylish text with bubbles, squares, upside down and more for social media.
Open toolFind & Replace
Find and replace text with regex support and case-sensitive options.
Open tool