Truncating Unicode


Doug Hoyte

The easiest problem in the world

This talk is about truncating strings.
  my $output = substr($unicode_string, 0, $max_length);
What could possibly go wrong with that? This talk is about:

Oh? You thought this talk would be about something useful? 😄😄😄

Truncating before encoding

The problem with truncating before encoding is you have no idea how long the encoded string is in terms of bytes:
length(encode("UTF-8", substr($my_string, 0, 10)))
  ==
 ????

There are other problems with such truncation too, as we'll see.

Truncating after encoding

The problem with truncating after encoding is it can corrupt text by cutting in inappropriate places:
$ perl -MEncode -Mutf8 -E 'my $e = encode("UTF-8", "λ");
                           say substr($e, 0, 1);'
�
If you try to insert such truncated text in a DB, you'll probably get a DB error.

Replacement character

By the way, that funny-looking question mark character that you've seen before is actually your terminal replacing an invalid byte with a special unicode character called the "replacement character" (U+FFFD):
Decoders replace invalid bytes with this character.

In the beginning there was ASCII...

Well actually, in the beginning there was baudot, a 5-bit character encoding invented in 1870*. This code was designed to be used by telegraphs and is still in use by radio amateurs.

Baudot was actually based off another 5-bit code invented by Gauss in 1834 which was in turn based off a 5-bit code invented by Francis Bacon in 1623.
* Not a typo
† Not a typo
‡ Still not a typo

Use somebody else's alphabet

5-bit codes like baudot can only encode 32 symbols... Can the same alphabet encode multiple languages?
Of course switching a whole writing system is another matter...

Funny looking letters

It has been apparent for some amount of time (at least to non-americans 😉) that people may wish to use non-latin characters. For example, russian-speakers adapted morse code to their alphabet and sometimes it worked out:

EnglishMorse CodeRussian
P· − − ·П
Sometimes not so much:
EnglishMorse CodeRussian
V· · · −Ж

Necessity is the mother of invention...

RussianЖиваго
Visual transliteration}|{uBaro
Phonetic transliterationZhivago
Arabic scriptتحكي عربي؟
Arabizi (chat alphabet)ta7ki 3arabi?
TranslationDo you speak Arabic?

Codebrian Explosion

So there are all these languages, we need mappings from bit representations to their symbols.

Code-pages


Have you ever catted a binary file to a unix terminal and had your prompt screwed up?

doug@hcsw:~$ echo -e "\x1B(0"

␍┤±@␤␌┬:·$ 

This is because of escape codes triggering ISO 2022 code-page switching in your terminal (where multiple character-sets can be tunneled over a 7-bit transport)
Bonus question: How do you fix it?

How to repair your terminal

Answer:

You echo the appropriate ISO 2022 reset sequence of course:

echo -e "\x1Bc"

Or, if for some reason you can't remember that, there is the reset command

Fundamental Invention of Unicode

Unicode did the computer-sciency thing and abstracted characters away from their bit pattern representations.
Although you could argue that the 19th century Chinese telegraph code was the first use of "code-points" in the unicode sense

16 bits is enough for anybody

In the original unicode design document, Unicode 88, the designers never anticipated needing more than 2 bytes to store a code-point:

Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the characters of all the world's living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose.
Needless to say, that didn't fly.

Encodings

Any unicode code-point can be represented in various ways:
UCS-2UTF-16UTF-8
%00 2500 2525
ɚ02 5A02 5AC9 9A
20 3020 30E2 80 B0
📀D8 3D DC C0F0 9F 93 80

Stateful encoding

An idea for encoding unicode that never really took off (except arguably ISO 2022) is stateful encoding.

Standard Compression Scheme for Unicode: This mode maps the byte values 128–255 to a window of unicode code-points and has special shifter commands to change where this window points.

Binary Ordered Compression for Unicode: This mode encodes the full value of the first unicode code-point, and then records the code-point differences. For example, the string ABCDCBA becomes the following sequence of code-points:
  65  1  1  1  -1  -1  -1
This is similar to image gradient filters: Runs of similar numbers compress well.

Mojibake

When the decoding doesn't match the encoding you get annoyingly screwed up text like:
$ perl -MEncode -E 'use encoding "utf-8";
                    say encode("utf8", "文字化け")'
æå­åã
The Japanese (who have suffered much more encoding anguish than we have) have a word for corruption due to encoding: mojibake.

Also why the heck are things like ☔ (U+2614) in Unicode? You can thank the Japanese for that too: They are in there to round-trip with Japanese character sets.

UTF-16 curse

Not all languages were as forward thinking as perl, and most just hacked in UCS-2 (later UTF-16):

Default python build exposes UTF-16 surrogate pairs:
>>> print sys.maxunicode
65535
>>> print len(u"𝔸")
2
Python built with --enable-unicode=ucs4
>>> print sys.maxunicode
1114111
>>> print len(u"𝔸")
1

Cursed truncation

Because of the UTF-16 curse, Python can cut code-points in half:
>>> print u"𝔸"[0:1]
�
Python 3.3 has taken steps to cure the UTF-16 curse by using UTF-8 internally (kinda), but Java/Windows/Javascript/etc are beyond redemption.

Ewwwwww gross:
$ nodejs -e 'console.log("\ud83d\udca9")'
💩
Much better:
$ perl -CAS -E 'say "\x{1f4a9}"'
💩

UTF-8

UCS-2 can't even represent all of unicode, and UTF-16 has numerous issues. One of which is that it can contain ␀ bytes and therefore don't play well with C APIs that use ␀ bytes as string terminators, for example in file/directory names.

UTF-8 was developed as a solution to this problem. In fact, it was originally called FSS-UTF (File System Safe UCS Transformation Format).

Over-long UTF-8

For security it's important to validate that UTF-8 is valid. A naive UTF-8 parser could allow multiple representations for a single character. All but the shortest representations are called "over-long" and are illegal UTF-8.

Varicode

You can jump to any byte in UTF-8 encoded text and scan forward to find the next byte that starts with 0 or 11 bits: This is the start of a new character.

In amateur radio there is an encoding called Varicode which is self-synchronising at the bit-level:

Fibonacci codes

How do you generate codes that don't contain a 00 (or 11) bit sequence? With the Fibonacci sequence of course (did you think it was just for recursion homework? 😁)

Suppose we want the code for 17. Pick the largest fibbonaci number ≤ 17 and subtract this number. Repeat until remainder is 0:
123581321
×101001
Σ1313= 17
This sequence is 100101. Fibonacci sequences will never contain the bit-string 11. If you want sequences that can contain 11 but never contain 111 there is the less-commonly known Tribonacci sequence (see the N-bonacci family).

Combining Characters

So if we don't do silly things like operate on UTF-16 surrogates, and we understand synchronising boundaries like in UTF-8 then at least we can avoid truncating in the middle of characters right? Not so fast…

Unicode has a concept of "combining" characters (actually combining code-points). They are modifier code-points that change the previous code-point in some way.

For example, the U+0301 code point adds an acute accent:
  "ne\x{301}e"' : née
Truncating after the second code-point would drop the accent which is a serious corruption of text in some languages.

Combining Characters: Thai

In English we think of accents as decorative fluff, but they are fundamental to the encoding of some languages such as Thai:

Combining Characters: Misc

Unicode 8 will introduce combining characters to change the skin colours of emoji:

Normalisation Forms

To complicate things more, often there are multiple representations for the same character (often because a combined character was present in an encoding Unicode provides round-tripping for).

Combining Grapheme Joiner

By the way, if you want to prevent a combining mark from composing even under NFC, there's a character for that:

use Unicode::Normalize;

say Dumper(NFC("e\x{34f}\x{301}"));

## $VAR1 = "e\x{34f}\x{301}";

How many bytes can a char take up?

  • Technically no limit to number of combining characters
  • UAX-15 recommends a limit of 30
  • Largest "legitimate" use is Tibetan character with 8 combining marks

Extended Grapheme Clusters

So what is a "character" in unicode? Byte? No. Unicode works at a higher level than bytes. Code-point? No. Code-points can combine together to make single letters.

TR29 EGC Regexp

Here is the regexp to validate an EGC, straight out of the TR29 report:
Term Regex Notes
combining character sequence base? ( Mark | ZWJ | ZWNJ )+ A single base character is not a combining character sequence. However, a single combining mark is a (degenerate) combining character sequence.
extended combining character sequence extended_base? ( Mark | ZWJ | ZWNJ )+ extended_base includes Hangul Syllables
legacy grapheme cluster ( CRLF
| ( RI-sequence
| Hangul-Syllable
| !Control )
  Grapheme_Extend*
| . )
A single base character is a grapheme cluster. Degenerate cases include any isolated non-base characters, and non-base characters like controls.
extended grapheme cluster ( CRLF
| Prepend*
  ( RI-sequence
| Hangul-Syllable
| !Control )
  ( Grapheme_Extend
| SpacingMark )*
| . )
Extended grapheme clusters add prepending and spacing marks

EGC Special-Cases

As you can see from this regexp, there are a couple special cases in the segmentation standard:

Unicode::Truncate

My module Unicode::Truncate:

Inline::C

A bit of a detour:

use Inline C;

print add(4, 9);

__END__
__C__
int add(int x, int y) {
  return x + y;
}

Inline::Filters::Ragel

use Inline C, filters => [ [ Ragel => '-G2' ] ];

__END__
__C__

// C + ragel code goes here

Let's look at the code

Code walk-through:
lib/Unicode/Truncate.pm

Unicode test suites

The Unicode Consortium maintains many large test files for implementations to check their conformance to Unicode standards ("unidata").

Unicode::Truncate ships the GraphemeBreakTest.txt file from Unicode 7.0 which is full of hundreds of lines like this:

÷ 0020 × 0308 ÷ 0020 ÷
The above indicates that you can (÷) break before the first and third code-points, but cannot (×) before the second (since it's a combining character).

Questions/Comments?

𝒻𝒾𝓃