Monday 7 December 2015

UTF-8 for Cuis, Pharo and Squeak

Unicode UTF-8 for Smalltalks 

The aim - Sortable, equivalency-testable Unicode strings for Smalltalks. 


Phase 1,
Sortable, equivalency-testable UTF-8 strings which 

encode ONLY ASCII and ISO 8859-1 ("ISO Latin-1")

Step 1 - Thinking



0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
    b) UTF-8 can encode all of those characters in 1 byte, but can prefer some of them to be encoded as sequences of multiple bytes.  And can encode additional characters as sequences of multiple bytes.

1) Smalltalk has long had multiple String classes.

2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
    is encoded as a UTF-8 codepoint of nn hex.

3) All valid ISO-8859-1 characters have a character code between 20 hex and 7E hex, or between A0 hex and FF hex.
https://en.wikipedia.org/wiki/ISO/IEC_8859-1

4) All valid ASCII characters have a character code between 00 hex and 7E hex.
https://en.wikipedia.org/wiki/ASCII

5) a) All character codes which are defined within ISO-8859-1 and also defined within ASCII.  (i.e. character codes 20 hex to 7E hex) are defined identically in both.

b) All printable ASCII characters are defined identically in both ASCII and ISO-8859-1

6) All character codes defined in ASCII  (00 hex to 7F hex) are defined identically in Unicode UTF-8.

7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex - FF hex ) are defined identically in UTF-8.

8) => some Unicode codepoints map to both ASCII and ISO-8859-1.
         all ASCII maps 1:1 to Unicode UTF-8
         all ISO-8859-1 maps 1:1 to Unicode UTF-8

9) All ByteStrings elements which are either a valid ISO-8859-1 character  or a valid ASCII character are *also* a valid UTF-8 character.

10) ISO-8859-1 characters representing a character with a diacritic, or a two-character ligature, have no ASCII equivalent.  In Unicode UTF-8, those character codes which are representing compound glyphs, are called "compatibility codepoints".

11) The preferred Unicode representation of the characters which have compatibility codepoints is as a  a short set of codepoints representing the characters which are combined together to form the glyph of the convenience codepoint, as a sequence of bytes representing the component characters.

12) Some concrete examples:

A - aka Upper Case A
In ASCII, in ISO 8859-1
ASCII A - 41 hex
ISO-8859-1 A - 41 hex
UTF-8 A - 41 hex

BEL (a bell sound, often invoked by a Ctrl-g keyboard chord)
In ASCII, not in ISO 8859-1
ASCII : BEL  - 07 hex
ISO-8859-1 : 07 hex is not a valid character code
UTF-8 : BEL - 07 hex

£ (GBP currency symbol)
In ISO-8859-1, not in ASCII
ASCII : A3 hex is not a valid ASCII code
UTF-8: £ - A3 hex
ISO-8859-1: £ - A3 hex

Upper Case C cedilla
In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint *and* a composed set of codepoints
ASCII : C7 hex is not a valid ASCII character code
ISO-8859-1 : Upper Case C cedilla - C7 hex
UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex  
Unicode preferred Upper Case C cedilla  (composed set of codepoints)
   Upper case C 0043 hex (Upper case C)
       followed by
   cedilla 00B8 hex (cedilla)

13) For any valid ASCII string *and* for any valid ISO-8859-1 string, aByteString is completely adequate for editing and display.

14) When sorting any valid ASCII string *or* any valid ISO-8859-1 string, upper and lower case versions of the same character will be treated differently.

15) When sorting any valid ISO-8859-1 string containing letter+diacritic combination glyphs or ligature combination glyphs, the glyphs in combination will treated differently to a "plain" glyph of the character
i.e. "C" and "C cedilla" will be treated very differently.  "ß" and "fs" will be treated very differently.

16) Different nations have different rules about where diacritic-ed characted and ligature pairs should be placed when in alphabetical order.

17) Some nations even have multiple standards - e.g.  surnames beginning either "M superscript-c" or "M superscript-a superscript-c" are treated as beginning equivalently in UK phone directories, but not in other situations.


Some practical upshots
==================

1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8,  for any single character it considers valid, or any ByteString it has made up of characters it considers valid.

2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any other Smalltalk with a single byte ByteString following ASCII or ISO-8859-1.

3) Any Smalltalk (or derivative language) using ByteString can immediately consider it's ByteString as valid UTF-8, as long as it also considers the ByteSring as valid ASCII and/or ISO-8859-1.

4) All of those can be successfully exported to any system using UTF-8 (e.g. HTML).

5) To successfully *accept* all UTF-8 we much be able to do either:
a) accept UTF-8 strings with composed characters 
b) convert UTF-8 strings with composed characters into UTF-8 strings that use *only* compatibility codepoints.


Having written all this, I have realised a picture is worth a thousand words, and I've now drawn this as a Venn diagram.
This note
It can be found at both:
http://smalltalk.uk.to/unicode-utf8.html

and at my Smalltalk in Small Steps blog at:


I therefore as a first step, propose these classes

a Utf8CompatibilityString class.

   asByteString  - ensure only compatibility codepoints are used.  Ensure it doews not encode characters above 00FF hex.

   asIso8859String - ensures only compatibility codepoints are used, and that the characters are each valid ISO 8859-1

   asAsciiString - ensures only characters 00hex - 7F hex are used.

   asUtf8ComposedIso8859String - ensures all compatibility codepoints are expanded into small OrderedCollections of codepoints

a Utf8ComposedIso8859String class - will provide sortable and comparable UTF8 strings of all ASCII and ISO 8859-1 strings.


Then a Utf8SortableCollection class - a collection of Utf8ComposedIso8859Strings words and phrases.

Custom sortBlocks will define the applicable sort order.  

We can create a Dictionary of named, prefabricated sortBlocks.

This will work for all UTF8 composed strings of ISO-8859-1 and ASCII strings.

If anyone has better names for the classes, please let me know.

If anyone else wants to help 
    - build these, 
    - create SUnit tests for these
    - write documentation for these
Please let me know.  euan mee at a well known email system run by Google.com - gmail