In summary, Unicode code points (characters) are 32 bit. JavaScript manipulates Unicode in utf-16 for historical reasons, because at some point before Unicode, 16 bit was deemed enough (ucs-2). utf-16 run length encodes Unicode 32 codepoints into one or two code units. Splitting in a middle of a codepoints produces one invalid half string, and one semantically different half string.
emojies are a sequence of Unicode codepoints producing a single grapheme. Splitting in the middle of a grapheme will produce two valid strings, but with some funky half baked emoji. So for a text editor it makes sense to split between grapheme boundaries.
Just noticed this is getting some traffic! It's a little buried in the post, but I made an interactive tool for exploring surrogate pairs as part of this:
I had an emoji cut in half problem in Dart. I was a bit surprised because I thought substring operations worked on characters. It only caused an invalid Unicode symbol though so not too bad.
Once I ran into this it became hard to treat strings “normally” in any situation or, alternatively, I’d force hard encoding requirements in the domain. Regardless, handling grapheme clusters properly is hard and easy to get wrong.
I recently ported a program from python to rust and the original author used string regexes. Input and output document encoding mattered but the characters that needed to be matched were always lower ASCII. The python program could have used binary regexes, but instead forced an input encoding (UTF-8) and made the user choose an output encoding. When the input comes from an unknown process or legacy data, however, you don’t always get the luxury of assuming the encoding. Switching to binary regexes and ignoring encoding altogether simplified logic, eliminated classes of errors, and made the program work in scenarios it couldn’t earlier. Getting rid of the last decoding/encoding code gave me so much relief, especially when all of the whacky encoding tests I had already written continued to work.
You are reminding me we also circled an issue at one point where a backend system in Python needed to agree on the same character count length of a piece of content was the client (JavaScript). Another place Intl.Segmenter would've helped.
If I'm remembering correctly, we briefly explored a solution where we told Python "This is a UTF-16LE encoded string" so the count would match, but I think we learned/realized the endianness is actually dictated by the client's machine (Going from memory here). Ultimately we just changed the solution so the client was the source of truth about lengths and counts.
These threads are surfacing all kinds of things I forgot about and didn't add in that blog post. Maybe I need to write another, haha.
> It would have been expensive, but all characters should have been fixed size 64bit values
You're making the same mistake that numerous people made before you: thinking that it's as simple as using arrays of large enough numbers. First they thought that two bytes per symbol would be enough, then four. Spoiler alert: it wasn't. And eight won't work either.
> It would have been expensive, but all characters should have been fixed size 64bit values.
It would have been a non-starter. "Suck it up" has never gotten a standard adopted. UTF-8 is about as elegant as it gets, though Java and JS still managed to fuck that up too (they both encode every codepoint outside the BMP with surrogate pairs)
it's good to know about surrogate pairs in unicode. It was new to me too when being part of tracking down incomplete uniode flags in the (excellent) phanpy mastodon client.
My recollection (that I didn't add to the story): I don't think Intl.Segmenter had great browser support then (2022). Even if it had it still wasn't a quick/obvious fix for our problem with where it was occurring in our stack. But I do remember looking at it then.
Modern string libraries largely use UTF-8 [0], and surrogates, regardless of whether they’re paired, are invalid in UTF-8. So, in a modern string library, as built in to most modern languages, you will not encounter surrogates except when translating between encodings.
[0] But everyone disagrees as to what indexing a string means, so you need to make an actual choice if you want anything involving indexing to match across languages.
In summary, Unicode code points (characters) are 32 bit. JavaScript manipulates Unicode in utf-16 for historical reasons, because at some point before Unicode, 16 bit was deemed enough (ucs-2). utf-16 run length encodes Unicode 32 codepoints into one or two code units. Splitting in a middle of a codepoints produces one invalid half string, and one semantically different half string.
emojies are a sequence of Unicode codepoints producing a single grapheme. Splitting in the middle of a grapheme will produce two valid strings, but with some funky half baked emoji. So for a text editor it makes sense to split between grapheme boundaries.
Just noticed this is getting some traffic! It's a little buried in the post, but I made an interactive tool for exploring surrogate pairs as part of this:
- https://george.mand.is/invalid-surrogate-pairs/
I thought it was something that's easier to play with and feel than necessarily just read about.
I had an emoji cut in half problem in Dart. I was a bit surprised because I thought substring operations worked on characters. It only caused an invalid Unicode symbol though so not too bad.
Once I ran into this it became hard to treat strings “normally” in any situation or, alternatively, I’d force hard encoding requirements in the domain. Regardless, handling grapheme clusters properly is hard and easy to get wrong.
I recently ported a program from python to rust and the original author used string regexes. Input and output document encoding mattered but the characters that needed to be matched were always lower ASCII. The python program could have used binary regexes, but instead forced an input encoding (UTF-8) and made the user choose an output encoding. When the input comes from an unknown process or legacy data, however, you don’t always get the luxury of assuming the encoding. Switching to binary regexes and ignoring encoding altogether simplified logic, eliminated classes of errors, and made the program work in scenarios it couldn’t earlier. Getting rid of the last decoding/encoding code gave me so much relief, especially when all of the whacky encoding tests I had already written continued to work.
You are reminding me we also circled an issue at one point where a backend system in Python needed to agree on the same character count length of a piece of content was the client (JavaScript). Another place Intl.Segmenter would've helped.
If I'm remembering correctly, we briefly explored a solution where we told Python "This is a UTF-16LE encoded string" so the count would match, but I think we learned/realized the endianness is actually dictated by the client's machine (Going from memory here). Ultimately we just changed the solution so the client was the source of truth about lengths and counts.
These threads are surfacing all kinds of things I forgot about and didn't add in that blog post. Maybe I need to write another, haha.
Writing property tests on functions that work with strings is a good way to find lots of Unicode issues.
Damn, I’ve never really had to deal with Unicode all that much.
Was already bad enough that instead of bytes, we have to worry about code points. Now even that isn’t enough?
It would have been expensive, but all characters should have been fixed size 64bit values.
> It would have been expensive, but all characters should have been fixed size 64bit values
You're making the same mistake that numerous people made before you: thinking that it's as simple as using arrays of large enough numbers. First they thought that two bytes per symbol would be enough, then four. Spoiler alert: it wasn't. And eight won't work either.
> It would have been expensive, but all characters should have been fixed size 64bit values.
It would have been a non-starter. "Suck it up" has never gotten a standard adopted. UTF-8 is about as elegant as it gets, though Java and JS still managed to fuck that up too (they both encode every codepoint outside the BMP with surrogate pairs)
it's good to know about surrogate pairs in unicode. It was new to me too when being part of tracking down incomplete uniode flags in the (excellent) phanpy mastodon client.
Author went for Intl.Segmenter too: https://github.com/cheeaun/phanpy/issues/1491
My recollection (that I didn't add to the story): I don't think Intl.Segmenter had great browser support then (2022). Even if it had it still wasn't a quick/obvious fix for our problem with where it was occurring in our stack. But I do remember looking at it then.
Great write-up. Do most modern languages handle invalid surrogates gracefully, or is it still a "good luck" situation depending on the runtime?
Modern string libraries largely use UTF-8 [0], and surrogates, regardless of whether they’re paired, are invalid in UTF-8. So, in a modern string library, as built in to most modern languages, you will not encounter surrogates except when translating between encodings.
[0] But everyone disagrees as to what indexing a string means, so you need to make an actual choice if you want anything involving indexing to match across languages.
The language handled it fine. It will generally just show replacement characters (�) for combos that don't map to anything.
It was really `encodeURIComponent` that didn't handle it gracefully.
If you just type this into the console (surrogate pair for cowboy smiley face emoji), you see it encodes it ("%F0%9F%A4%A0"):
encodeURIComponent("\uD83E\uDD20")
If you give it an invalid surrogate pair, it will throw an actual error:
encodeURIComponent("\uDD20\uD83E")