Question 1

Which length should I use to validate a tweet?

Accepted Answer

Twitter uses a custom weighted count: ASCII counts as 1, most CJK counts as 2, total cap 280. It is closest to grapheme count but not identical. Use the twitter-text library for production validation; grapheme count is a reasonable UX approximation.

Question 2

Why does my database say the username is too long when it looks fine?

Accepted Answer

Three common causes. The column is VARCHAR with a byte limit (MySQL utf8mb4) and the user typed an emoji consuming 4 bytes. Or the schema is NVARCHAR on SQL Server counting UTF-16 units, so a supplementary-plane character costs 2. Or the connection is using latin1 and the Unicode gets mangled on insert. Compare all three metrics against your column definition.

Question 3

Does any of this compute on a server?

Accepted Answer

No. TextEncoder, Intl.Segmenter, string.length, and the spread operator are all native browser APIs that run inside the page. There is no fetch call, no worker, and no service worker intercepting your string. You can pop open DevTools, throttle the network to offline, and the calculator keeps working - your input never leaves the tab.

Question 4

How does SMS character counting actually work?

Accepted Answer

For GSM 03.38 alphabet messages, the limit is 160 septets (7-bit characters) per single message, or 153 septets per part in a concatenated message. For non-GSM characters (any emoji, any non-Latin script) the message falls back to UCS-2 encoding with a 70-character limit per single message or 67 per part. Carriers bill per part, so a single emoji in an otherwise-ASCII message can triple your SMS cost. None of the metrics here directly model this; specialized SMS libraries do.

Question 5

When should I normalize before measuring?

Accepted Answer

Always, if you care about consistent comparison. Two visually-identical strings can be encoded differently (composed vs decomposed accents, different Unicode normalization forms) and have different byte lengths. Call str.normalize("NFC") in JavaScript before measuring for the most stable results. NFD is the form decomposed (longer); NFKC is compatibility composed (collapses ligatures and fullwidth forms). Choose once and apply everywhere in your pipeline.

Question 6

Why do some combining emoji show such high UTF-8 byte counts?

Accepted Answer

Because they are stitched together from multiple code points joined by zero-width joiners. The family-of-four emoji is technically four people emoji, each 4 bytes, separated by three ZWJ characters at 3 bytes each - 25 bytes for what renders as one image. Adding skin-tone modifiers or gender variants pushes it higher. If you want byte-fair treatment of such emoji, set a UTF-8 byte budget generous enough for them, or reject emoji in the validation if your use case does not need them.

Question 7

Does .length ever equal code-point count?

Accepted Answer

Only when your string contains no characters above U+FFFF. For pure ASCII, Cyrillic, Arabic, Hebrew, CJK, Greek, and anything else in the Basic Multilingual Plane, .length equals code-point count. Supplementary-plane characters - emoji, rare CJK extensions, ancient scripts like Gothic or Phoenician, and mathematical alphanumeric symbols - are encoded as surrogate pairs in UTF-16 and count as 2 in .length.

Question 8

How is this different from wc -c and wc -m?

Accepted Answer

wc -c counts bytes, which matches this tool's UTF-8 bytes metric when the file is UTF-8 encoded. wc -m with a UTF-8 locale counts characters as Unicode code points, matching the code-points metric here. Neither tool reports UTF-16 code units or graphemes. For interactive debugging across the full matrix, this web tool is faster than switching locales and re-running wc.

Question 9

What is the practical difference between NFC and NFD?

Accepted Answer

NFC (Canonical Composition) prefers pre-composed characters: a single code point U+00E9 for &eacute;. NFD (Canonical Decomposition) prefers the base character plus combining marks: U+0065 U+0301. They render identically but NFD is longer in both bytes and code-point count. macOS historically used NFD in filenames (which caused surprising git diff noise when syncing repos across platforms), while most other systems use NFC. When in doubt, normalize to NFC.

Question 10

How should I use this tool when sizing a DynamoDB attribute?

Accepted Answer

DynamoDB limits items to 400 KB measured in raw bytes. That is UTF-8 byte count on strings. Use the UTF-8 Bytes metric to measure a sample, then provision for the max. Sort keys plus partition keys must stay under 2048 bytes. DynamoDB docs define everything in bytes, not characters.

Question 11

Can I trust Intl.Segmenter for grapheme counting?

Accepted Answer

Yes, in all major browsers from 2022 onward. Intl.Segmenter (ECMA-402) implements UAX #29 extended grapheme clusters using ICU data. Chromium, Firefox, Safari, and Node 16+ ship it. The output matches cursor-navigation through complex emoji - that is the gold standard of "one user character."

String Length Calculator

How to Use the String Length Calculator

What Each Metric Means Under the Hood

When These Metrics Matter

Common Pitfalls and Edge Cases

Why There Are Six Different Answers

Comparison to Alternatives

Frequently Asked Questions

Related tools

More Text Tools

Binary to Text

Case Converter

Character Counter

Emoji Picker & Search

Fancy Text Generator

Find & Replace