Skip to main content

HTML Entity Encode / Decode

Encode special characters to HTML entities or decode them back.

Reviewed by · Last reviewed

How to Use the HTML Entity Encoder/Decoder

  1. Pick Encode or Decode using the mode toggle. Encode turns characters like <, >, &, ", and apostrophe into HTML entities; Decode reverses the process, turning &lt; back into <.
  2. Paste or type text in the input pane. The output updates on every keystroke, so you see the transformation in real time without clicking anything.
  3. Toggle "Encode all non-ASCII" if you need every character above code point 127 (accented letters, emoji, CJK) turned into numeric entities. Leave it off to keep only the five XML special characters encoded.
  4. Use the Swap button to move the output back into the input and flip the mode. That lets you round-trip a string to confirm your encoding is lossless.

What the Codec Does and How

The encoder scans the input string character by character. For each character it checks against a replacement table. The default table covers the five XML predefined entities: & becomes &amp;, < becomes &lt;, > becomes &gt;, " becomes &quot;, and the apostrophe becomes &#39; (the named entity &apos; is valid XML but was historically missing in older HTML, so the numeric form is safer). With "Encode all non-ASCII" enabled, any character with a code point above 127 emits &#code; in decimal.

The decoder handles three entity forms: named entities from the HTML named character references list (&copy;, &euro;, &mdash;, and over 2,000 others), decimal numeric references (&#169;), and hexadecimal numeric references (&#xA9;). Any sequence that is not a recognised entity is left untouched. Surrogate pairs in the numeric form are assembled into single code points, so entities like &#128512; correctly decode to the smiling emoji.

When You Need This

  • Pasting a code snippet containing < or > into an HTML page without breaking the surrounding markup.
  • Preparing a string for safe inclusion in an XML document (SOAP request, SVG, Atom feed) where the five reserved characters must be encoded.
  • Cleaning up text extracted from an older CMS where everything was stored as entities and you want plain Unicode back.
  • Debugging why a web page shows &amp;amp; three times over - usually a double-encoding bug you can unwind with repeat Decode passes.
  • Producing ASCII-only content for a system (legacy SMTP, old SMS gateways) that cannot reliably carry UTF-8.
  • Encoding a long paste that contains every punctuation flavor to see which ones need escaping in your target format.

Gotchas

  • The apostrophe. &apos; was added to HTML5 but was absent in HTML4, so older tools may not recognise it. The encoder uses the numeric form &#39; for maximum compatibility across HTML 4 and XML 1.0.
  • Semicolons are required. Modern browsers forgive missing terminators on legacy entities (&copy without ;), but a strict XML parser will reject them. When in doubt, include the semicolon.
  • Numeric bounds. HTML numeric character references above 0x10FFFF are invalid. The decoder silently drops out-of-range values rather than attempting a best-effort conversion.
  • Surrogate pairs. UTF-16 surrogate halves (0xD800 to 0xDFFF) should not appear as standalone entities. If your input has them, the decoder outputs the replacement character U+FFFD to signal the problem.
  • Double-encoded input. &amp;lt; is a string where an ampersand was encoded on top of an already-encoded entity. One decode pass yields &lt;; a second pass yields <. Re-run decode until the output stops changing.

Spec Background

HTML character references are defined in the WHATWG\'s HTML Living Standard, specifically the "Named character references" and "Character reference state" sections of the tokenizer. The canonical list of named entities lives at html.spec.whatwg.org and contains just over 2,000 entries, some of which are legacy compatibility with or without trailing semicolons. XML 1.0, section 4.6, defines only the five predefined entities (amp, lt, gt, quot, apos) and requires every other entity to be declared in a DTD. That difference is why a string that decodes cleanly in a browser may fail in an XML parser - the XML parser has no implicit knowledge of &copy;. Numeric character references work identically in both standards.

Similar Tools

The html package on npm wraps the same entity tables and is a drop-in replacement if you need this logic in a Node script. Python\'s html module in the standard library (html.escape and html.unescape) covers the same use cases. sed or awk can do a quick five-entity escape but cannot handle the full named-entity dictionary. Chrome DevTools\' Console lets you use document.createElement("textarea") to coerce a string through the browser\'s own entity decoder - fast for one-off decoding. This in-browser tool saves you the ceremony of any of those when you just need one quick round-trip.

Frequently Asked Questions

Which characters are encoded by default?

The five HTML/XML special characters: <code>&amp;</code> becomes <code>&amp;amp;</code>, <code>&lt;</code> becomes <code>&amp;lt;</code>, <code>&gt;</code> becomes <code>&amp;gt;</code>, <code>&quot;</code> becomes <code>&amp;quot;</code>, and the apostrophe becomes <code>&amp;#39;</code>. That is the minimal set required to keep a string safe inside HTML text content or attribute values. Every other character passes through unchanged, keeping your emoji, accented letters, and CJK text readable in the source.

What does the "Encode all non-ASCII" option do?

With that toggle on, any character whose code point is above 127 gets replaced with a decimal numeric entity (<code>&amp;#</code>code<code>;</code>). The output becomes pure ASCII, which is useful for contexts that are unreliable with UTF-8 - legacy mail transports, some SMS gateways, or systems that default to ISO-8859-1. The downside is the output is much less readable; turn it off if your pipeline handles UTF-8 end to end.

Does the decoder handle named entities?

Yes. The decoder accepts every named character reference in the HTML Living Standard list, which is over 2,000 entries long. Common ones like <code>&amp;copy;</code>, <code>&amp;reg;</code>, <code>&amp;euro;</code>, <code>&amp;nbsp;</code>, and <code>&amp;mdash;</code> decode to their Unicode equivalents. It also tolerates a handful of legacy entities that worked without a trailing semicolon in old browsers, though the encoder always emits the semicolon-terminated form.

Is this safe to use on untrusted input?

Encoding the five special characters is the foundation of XSS prevention, and this tool implements that encoding correctly. However, safe HTML output requires more than entity encoding - you also need to avoid dangerous attributes (<code>javascript:</code> URLs), script contexts, and unsafe uses of user input in inline event handlers. If you are handling untrusted content, do encoding at the output boundary in your web framework rather than as a copy-paste step.

Is my text sent to a server?

No. The codec runs inside your browser tab as a Preact component and uses in-memory string operations only. There is no fetch call, no websocket, and no logging. People often test encoding on sensitive strings (API keys, internal URLs, personal data) and the local-only guarantee matters; you can verify with DevTools Network showing zero requests while you type.

How are Unicode code points above U+FFFF encoded?

In numeric form they appear as a single decimal or hex reference - for example the pile of poo emoji <code>&#128169;</code> is <code>&amp;#128169;</code> in decimal or <code>&amp;#x1F4A9;</code> in hex. JavaScript strings internally store these as UTF-16 surrogate pairs, but the encoder converts pairs to their original code point before emitting the entity. The decoder does the reverse, reassembling the surrogate pair on the way out.

Can I use the output directly in an XML document?

The default five-entity encoding is exactly what XML 1.0 section 4.6 defines, so yes. If you encoded with the non-ASCII option on, numeric entities are also valid XML. Avoid named entities beyond the five predefined ones - <code>&amp;copy;</code>, <code>&amp;nbsp;</code>, and the rest are HTML-specific and an XML parser without a DTD will reject them.

Why use <code>&amp;#39;</code> instead of <code>&amp;apos;</code>?

Historical compatibility. The named entity <code>&amp;apos;</code> is valid XML 1.0 and HTML5 but was not defined in HTML 4.01; older Internet Explorer versions and some email clients display it literally instead of decoding. The numeric form <code>&amp;#39;</code> works everywhere that entities work, so the encoder uses it for the apostrophe by default.

What about double-encoded text?

Double-encoding happens when text is encoded twice by accident - <code>&amp;amp;lt;</code> for <code>&lt;</code>. One Decode pass yields <code>&amp;lt;</code>; a second pass yields <code>&lt;</code>. Run Decode repeatedly (the Swap button helps chain operations) until the output stops changing. The root cause is usually a web form that re-encodes data on submission; fix the pipeline rather than relying on manual decoding.

How does HTML encoding differ from URL encoding?

They solve different problems. HTML encoding (this tool) makes text safe inside HTML element content or attribute values by replacing structural characters with entities. URL encoding (percent-encoding, RFC 3986) makes text safe inside a URL by replacing reserved characters with <code>%XX</code> sequences. A string inside a query parameter of an HTML link needs both - first URL-encoded to form a valid URL, then HTML-encoded so the <code>&amp;</code> separators do not break the HTML. Use the URL Encoder/Decoder tool for the other direction.

More Developer Tools