Skip to main content

String Length Calculator

Calculate string length in characters, bytes (UTF-8/UTF-16) and graphemes.

Reviewed by · Last reviewed

0
.length
0
UTF-8 Bytes
0
UTF-16 Bytes
0
Graphemes
0
Code Points
0
Lines

How to Use the String Length Calculator

  1. Paste or type your string into the input textarea. Single tokens, lines of code, or long paragraphs all work.
  2. Watch the metrics update in the cards above the output: .length, UTF-8 bytes, UTF-16 bytes, graphemes, code points, and lines. Every metric refreshes on each keystroke.
  3. Look for the divergence warning. When .length differs from the grapheme count, a yellow note appears - your string contains characters that will break naive length checks.
  4. Use the UTF-8 byte count when sizing database columns, HTTP headers, or file sizes. Use the UTF-16 byte count when thinking about in-memory strings in Java, C#, or JavaScript.
  5. Copy any specific metric by selecting the number directly from the card; the input textarea keeps your full string available for further tweaking.

What Each Metric Means Under the Hood

.length reads the JavaScript string property, which returns the number of UTF-16 code units - 16-bit chunks as the string is stored internally. UTF-8 bytes is computed by new TextEncoder().encode(str).length, which runs the WHATWG UTF-8 encoder and returns a Uint8Array whose byte length is what a UTF-8 file would have. UTF-16 bytes is simply str.length * 2, because every UTF-16 code unit is 2 bytes. Code points use the spread operator [...str].length, which iterates by Unicode scalar value and counts surrogate pairs as one. Grapheme count relies on Intl.Segmenter with granularity: "grapheme", the only correct way to count user-perceived characters in modern JavaScript. Line count splits on /\\r?\\n/. Every metric is computed locally with no network call.

When These Metrics Matter

  • Validating a user nickname against a 20-character limit where users assume "character" means grapheme, not UTF-16 code unit.
  • Sizing a database VARCHAR(255) column - PostgreSQL counts Unicode code points, MySQL with utf8mb4 counts bytes, and SQL Server NVARCHAR counts UTF-16 code units.
  • Calculating exact SMS payload fees, where carriers charge by GSM septet count for ASCII and UCS-2 character count for anything else.
  • Checking whether a tweet fits under 280 "characters" as Twitter defines them, which blends code points with weighted Latin vs CJK scoring.
  • Confirming that your API response body stays under a Content-Length limit expressed in bytes.
  • Spotting a bug where your form validator reports "too long" for a perfectly normal-looking name because the backend counts bytes and the UI counts characters.

Common Pitfalls and Edge Cases

  • Single emoji spanning 4 bytes. The grinning face 😀 is one grapheme, one code point (U+1F600), two UTF-16 code units, and four UTF-8 bytes. Any validator that disagrees about which of those is "length" will surprise someone.
  • Family emoji with ZWJ. A family-of-four emoji is seven code points joined by zero-width joiners - one grapheme but up to 25 UTF-8 bytes. The divergence here is dramatic and routinely breaks mobile input validators.
  • Hangul syllable blocks. Korean syllables can be precomposed (one code point) or decomposed into Jamo (multiple code points). The two forms render identically but have different lengths in every metric except grapheme count.
  • Combining accents. é as U+00E9 is 1 code point and 2 UTF-8 bytes; as e + combining acute (U+0065 U+0301) it is 2 code points and 3 UTF-8 bytes. Always normalize with str.normalize("NFC") before length comparison.
  • BOM and invisible characters. A leading U+FEFF byte-order mark adds 3 UTF-8 bytes that the user cannot see. Run suspicious inputs through the Invisible Character Detector on this site.
  • Normalization forms. NFC, NFD, NFKC, and NFKD produce different byte lengths for the same visual string; choose one consistently.

Why There Are Six Different Answers

The Unicode standard defines abstract characters and assigns them code points from U+0000 to U+10FFFF. UTF-8 (RFC 3629) encodes each code point as 1 to 4 bytes; UTF-16 (RFC 2781) as 1 or 2 16-bit code units; UTF-32 as a single 32-bit unit. JavaScript uses UTF-16 internally, so string.length counts code units, and a supplementary-plane emoji counts as 2. Grapheme clusters, defined by Unicode Standard Annex #29, group multiple code points into what the user perceives as one character - what you see on screen when you put the caret to the left and press right arrow once. The difference matters because form validation, database limits, SMS billing, and API length caps are all defined in different units by different systems.

Comparison to Alternatives

In Node.js the same metrics are reachable with str.length, Buffer.byteLength(str, "utf8"), and [...str].length; for graphemes you still need Intl.Segmenter or the grapheme-splitter npm package. Python's len() on a str counts code points, str.encode("utf-8") plus len gives bytes, and the grapheme package gives grapheme counts. Ruby's string.length counts grapheme clusters (since 2.6 with each_grapheme_cluster), which is why Ruby web apps often "just work" where Node apps surprise users. VS Code and JetBrains IDEs show code-unit count in the status bar; your editor's char count almost always means code units. Use this tool when you need all six numbers at once to debug a length-mismatch bug across a full stack.

Frequently Asked Questions

Which length should I use to validate a tweet?

Twitter uses a custom weighted count: ASCII counts as 1, most CJK counts as 2, total cap 280. It is closest to grapheme count but not identical. Use the twitter-text library for production validation; grapheme count is a reasonable UX approximation.

Why does my database say the username is too long when it looks fine?

Three common causes. The column is VARCHAR with a byte limit (MySQL utf8mb4) and the user typed an emoji consuming 4 bytes. Or the schema is NVARCHAR on SQL Server counting UTF-16 units, so a supplementary-plane character costs 2. Or the connection is using latin1 and the Unicode gets mangled on insert. Compare all three metrics against your column definition.

Does any of this compute on a server?

No. TextEncoder, Intl.Segmenter, string.length, and the spread operator are all native browser APIs that run inside the page. There is no fetch call, no worker, and no service worker intercepting your string. You can pop open DevTools, throttle the network to offline, and the calculator keeps working - your input never leaves the tab.

How does SMS character counting actually work?

For GSM 03.38 alphabet messages, the limit is 160 septets (7-bit characters) per single message, or 153 septets per part in a concatenated message. For non-GSM characters (any emoji, any non-Latin script) the message falls back to UCS-2 encoding with a 70-character limit per single message or 67 per part. Carriers bill per part, so a single emoji in an otherwise-ASCII message can triple your SMS cost. None of the metrics here directly model this; specialized SMS libraries do.

When should I normalize before measuring?

Always, if you care about consistent comparison. Two visually-identical strings can be encoded differently (composed vs decomposed accents, different Unicode normalization forms) and have different byte lengths. Call str.normalize("NFC") in JavaScript before measuring for the most stable results. NFD is the form decomposed (longer); NFKC is compatibility composed (collapses ligatures and fullwidth forms). Choose once and apply everywhere in your pipeline.

Why do some combining emoji show such high UTF-8 byte counts?

Because they are stitched together from multiple code points joined by zero-width joiners. The family-of-four emoji is technically four people emoji, each 4 bytes, separated by three ZWJ characters at 3 bytes each - 25 bytes for what renders as one image. Adding skin-tone modifiers or gender variants pushes it higher. If you want byte-fair treatment of such emoji, set a UTF-8 byte budget generous enough for them, or reject emoji in the validation if your use case does not need them.

Does .length ever equal code-point count?

Only when your string contains no characters above U+FFFF. For pure ASCII, Cyrillic, Arabic, Hebrew, CJK, Greek, and anything else in the Basic Multilingual Plane, .length equals code-point count. Supplementary-plane characters - emoji, rare CJK extensions, ancient scripts like Gothic or Phoenician, and mathematical alphanumeric symbols - are encoded as surrogate pairs in UTF-16 and count as 2 in .length.

How is this different from wc -c and wc -m?

wc -c counts bytes, which matches this tool's UTF-8 bytes metric when the file is UTF-8 encoded. wc -m with a UTF-8 locale counts characters as Unicode code points, matching the code-points metric here. Neither tool reports UTF-16 code units or graphemes. For interactive debugging across the full matrix, this web tool is faster than switching locales and re-running wc.

What is the practical difference between NFC and NFD?

NFC (Canonical Composition) prefers pre-composed characters: a single code point U+00E9 for é. NFD (Canonical Decomposition) prefers the base character plus combining marks: U+0065 U+0301. They render identically but NFD is longer in both bytes and code-point count. macOS historically used NFD in filenames (which caused surprising git diff noise when syncing repos across platforms), while most other systems use NFC. When in doubt, normalize to NFC.

How should I use this tool when sizing a DynamoDB attribute?

DynamoDB limits items to 400 KB measured in raw bytes. That is UTF-8 byte count on strings. Use the UTF-8 Bytes metric to measure a sample, then provision for the max. Sort keys plus partition keys must stay under 2048 bytes. DynamoDB docs define everything in bytes, not characters.

Can I trust Intl.Segmenter for grapheme counting?

Yes, in all major browsers from 2022 onward. Intl.Segmenter (ECMA-402) implements UAX #29 extended grapheme clusters using ICU data. Chromium, Firefox, Safari, and Node 16+ ship it. The output matches cursor-navigation through complex emoji - that is the gold standard of "one user character."

More Text Tools