Monday 7 December 2015

UTF-8 for Cuis, Pharo and Squeak

Unicode UTF-8 for Smalltalks 

The aim - Sortable, equivalency-testable Unicode strings for Smalltalks. 


Phase 1,
Sortable, equivalency-testable UTF-8 strings which 

encode ONLY ASCII and ISO 8859-1 ("ISO Latin-1")

Step 1 - Thinking



0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
    b) UTF-8 can encode all of those characters in 1 byte, but can prefer some of them to be encoded as sequences of multiple bytes.  And can encode additional characters as sequences of multiple bytes.

1) Smalltalk has long had multiple String classes.

2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
    is encoded as a UTF-8 codepoint of nn hex.

3) All valid ISO-8859-1 characters have a character code between 20 hex and 7E hex, or between A0 hex and FF hex.
https://en.wikipedia.org/wiki/ISO/IEC_8859-1

4) All valid ASCII characters have a character code between 00 hex and 7E hex.
https://en.wikipedia.org/wiki/ASCII

5) a) All character codes which are defined within ISO-8859-1 and also defined within ASCII.  (i.e. character codes 20 hex to 7E hex) are defined identically in both.

b) All printable ASCII characters are defined identically in both ASCII and ISO-8859-1

6) All character codes defined in ASCII  (00 hex to 7F hex) are defined identically in Unicode UTF-8.

7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex - FF hex ) are defined identically in UTF-8.

8) => some Unicode codepoints map to both ASCII and ISO-8859-1.
         all ASCII maps 1:1 to Unicode UTF-8
         all ISO-8859-1 maps 1:1 to Unicode UTF-8

9) All ByteStrings elements which are either a valid ISO-8859-1 character  or a valid ASCII character are *also* a valid UTF-8 character.

10) ISO-8859-1 characters representing a character with a diacritic, or a two-character ligature, have no ASCII equivalent.  In Unicode UTF-8, those character codes which are representing compound glyphs, are called "compatibility codepoints".

11) The preferred Unicode representation of the characters which have compatibility codepoints is as a  a short set of codepoints representing the characters which are combined together to form the glyph of the convenience codepoint, as a sequence of bytes representing the component characters.

12) Some concrete examples:

A - aka Upper Case A
In ASCII, in ISO 8859-1
ASCII A - 41 hex
ISO-8859-1 A - 41 hex
UTF-8 A - 41 hex

BEL (a bell sound, often invoked by a Ctrl-g keyboard chord)
In ASCII, not in ISO 8859-1
ASCII : BEL  - 07 hex
ISO-8859-1 : 07 hex is not a valid character code
UTF-8 : BEL - 07 hex

£ (GBP currency symbol)
In ISO-8859-1, not in ASCII
ASCII : A3 hex is not a valid ASCII code
UTF-8: £ - A3 hex
ISO-8859-1: £ - A3 hex

Upper Case C cedilla
In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint *and* a composed set of codepoints
ASCII : C7 hex is not a valid ASCII character code
ISO-8859-1 : Upper Case C cedilla - C7 hex
UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex  
Unicode preferred Upper Case C cedilla  (composed set of codepoints)
   Upper case C 0043 hex (Upper case C)
       followed by
   cedilla 00B8 hex (cedilla)

13) For any valid ASCII string *and* for any valid ISO-8859-1 string, aByteString is completely adequate for editing and display.

14) When sorting any valid ASCII string *or* any valid ISO-8859-1 string, upper and lower case versions of the same character will be treated differently.

15) When sorting any valid ISO-8859-1 string containing letter+diacritic combination glyphs or ligature combination glyphs, the glyphs in combination will treated differently to a "plain" glyph of the character
i.e. "C" and "C cedilla" will be treated very differently.  "ß" and "fs" will be treated very differently.

16) Different nations have different rules about where diacritic-ed characted and ligature pairs should be placed when in alphabetical order.

17) Some nations even have multiple standards - e.g.  surnames beginning either "M superscript-c" or "M superscript-a superscript-c" are treated as beginning equivalently in UK phone directories, but not in other situations.


Some practical upshots
==================

1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8,  for any single character it considers valid, or any ByteString it has made up of characters it considers valid.

2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any other Smalltalk with a single byte ByteString following ASCII or ISO-8859-1.

3) Any Smalltalk (or derivative language) using ByteString can immediately consider it's ByteString as valid UTF-8, as long as it also considers the ByteSring as valid ASCII and/or ISO-8859-1.

4) All of those can be successfully exported to any system using UTF-8 (e.g. HTML).

5) To successfully *accept* all UTF-8 we much be able to do either:
a) accept UTF-8 strings with composed characters 
b) convert UTF-8 strings with composed characters into UTF-8 strings that use *only* compatibility codepoints.


Having written all this, I have realised a picture is worth a thousand words, and I've now drawn this as a Venn diagram.
This note
It can be found at both:
http://smalltalk.uk.to/unicode-utf8.html

and at my Smalltalk in Small Steps blog at:


I therefore as a first step, propose these classes

a Utf8CompatibilityString class.

   asByteString  - ensure only compatibility codepoints are used.  Ensure it doews not encode characters above 00FF hex.

   asIso8859String - ensures only compatibility codepoints are used, and that the characters are each valid ISO 8859-1

   asAsciiString - ensures only characters 00hex - 7F hex are used.

   asUtf8ComposedIso8859String - ensures all compatibility codepoints are expanded into small OrderedCollections of codepoints

a Utf8ComposedIso8859String class - will provide sortable and comparable UTF8 strings of all ASCII and ISO 8859-1 strings.


Then a Utf8SortableCollection class - a collection of Utf8ComposedIso8859Strings words and phrases.

Custom sortBlocks will define the applicable sort order.  

We can create a Dictionary of named, prefabricated sortBlocks.

This will work for all UTF8 composed strings of ISO-8859-1 and ASCII strings.

If anyone has better names for the classes, please let me know.

If anyone else wants to help 
    - build these, 
    - create SUnit tests for these
    - write documentation for these
Please let me know.  euan mee at a well known email system run by Google.com - gmail

Monday 30 November 2015

Composing with roles

A role does several things (a bit of hand-waving here):
  • It might provide behaviour (methods)
  • It might require behaviour (acting sort of like an interface)
  • Any conflicts (duplicate methods) must be resolved at compile-time

Understanding of roles and their use in specialisation by composition, can be aided enormously by Sandi Metz's talk, "Nothing is Something" (link).

Why have keyword names for multiparameter message sends?

In Objective C, Swift and Rust, the parameter names in a function call are called keyword names. They trace their roots back to the Smalltalk language, where they are keyword names in message sends.
Classes and objects are often re-used from somewhere else, or form part of very large complex systems. They will often not have active maintenance attention for long periods at a time.
Improving the clarity and legibility of the code is very important in these situations, as code often ends up as the only documentation, especially when developers are under deadline pressure, or simply dislike writing comments.  (q.v. Comments are always wrong (link) 
A descriptive keyword name allows maintainers to quickly see what the purpose of each argument in a function call, and which order they appear in, by simply glancing at the function call itself, rather than having to delve deep into the function code itself. It makes explicit the implied meaning of the parameters.
The latest language to adopt keyword names for parameters in function calls is Rust (link) - described as "a systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety."  High uptime systems require greater code quality. Keyword names allow development and maintenance teams much more opportunity to avoid and to catch errors from sending the wrong parameter, or calling parameters out of order.
Keyword names can be wordy or terse, but Smalltalkers prefer wordy and descriptive to terse and meaningless. They can afford to be, because their IDE will do the bulk of such typing for them.

Why learn Smalltalk?

This is the answer I gave over on Stack Overflow, to 

Would you start learning Smalltalk? (link) 

1) Yes! 
It's always good to learn a language. 
If you are going to learn a language, make it a powerful, influential language that can be learnt easily and quickly.
Smalltalk remains a pre-eminent language and environment for learning OO concepts.
It is all objects, all the way down. This makes for a really consistent approach to working.
Integers are instances of Class Integer. Strings are a collection of character objects. Classes are singleton instance objects for the class they define.
Control structures work by sending get messages to instances of Class Boolean.
Even anonymous methods (blocks of code, aka blocks) are objects.
Everything is done by sending a message to an object. The syntax can be fitted on a postcard.
The clarity of the concepts and their implementation in Smalltalk mean that you can develop ways of thought which transfer directly into Java, Ruby and C#. I expect it's true for Python, too.
It's so good for making the concepts clear that a major UK University used Smalltalk to train 5,000 people a year in object-oriented computing.
Squeak 5, has just been released. It has gained major performance increases from its new Cog/Spur VM, which features with progressive garbage-collection.
Pharo 4 has a lovely clean-looking desktop theme. The next version, Pharo 5, will be released soon. It will move to using the Cog/Spur VM, it will have about 5,000 classes in the release, and additional packages of classes are readily available from the net via the Configuration Browser tool.
Squeak 5 is performant even on first-gen Raspberry Pis, and is almost 50% faster on the new $5 Raspberry Pi zero. $99 buys you a Raspberry Pi 2, screen and case - running a mature, fully feature-complete IDE.
Leading edge research is being done on co-ordinated, distributed OO systems in Smalltalk (e.g. Naiad and Spoon).
Some of the world's largest corporate databases are run on Smalltalk - including tracking of 60% of the world's shipping containers, and trading systems in the world's largest bank.
You can use Smalltalk as a sort of super-powered CoffeeScript, writing in Amber Smalltalk and transpiling to JavaScript, running in the browser.
Squeak, Pharo, and Amber are all Free, Open-source, open-licenced languages and environments.
Squeak and Pharo provide write-once, run anywhere facilities for MacOS, Windows and Linux. (Possibly RiscOS, too).
Dolphin Smalltalk is targetted firmly at native Windows look-and-feel, and lets you compile closed .exes of your finished work for distribution to end users. Further development of Dolphin by the vendor has stopped, but it is completely functional, and, like all Smalltalks, designed to be massively extensible. (Did I mention that Pharo now has 5,000 classes, compared to Squeak's 3,000? Pharo is a fork of Squeak 3.9)
**There is a How-to guide for installing and starting Squeak, Amber, Pharo, Cuis and Dolphin at: **http://beginningtosmalltalk.blogspot.co.uk/2015/11/how-to-get-smalltalk-up-and-running.html
The Seaside web framework runs on Squeak and on Pharo. It's a wonderful mature tool, as is the more traditional AidaWeb framework.
VisualAge, VisualWorks and Gemstone all provide enterprise-grade robust systems. Gemstone provides an infinitely scalable object database with transactions and persistence.

Do you already know Smalltalk? (same link) 

2) Yes - I do already use it.
I learnt it via the Open University, and was immediately productive in Ruby (a copy of the Pickaxe book and the library reference by my side). It helped me enormously with Java, and with Xerox Moo-code.
I have just returned to it to write apps to control manage and distribute responsive, massively multi-platform mobile apps.
I expect that soon I'll be re-writing my JavaScript mobile apps using Amber, too.

Sunday 29 November 2015

Squeak 5 and Pharo 4 side-by-side on Raspberry Pi (videos)

The videos are all of a Raspberry Pi Model B - i.e. it is the 1st Generation, single core 700MHz device.

For comparison, The newly launched Pi Zero is a single-core 1GHz device, so is substantially faster.

The Raspberry Pi 2 is a 900MHz quad-core device.

The Orange Pi (a Chinese Pi-alike) has options up to 1.6GHz.

Intel Atom CPUs start from ~$120 - that is for the CPU (or the System on Chip?) alone.

Be warned - the videos have been compressed as much as possible and are downloaded, not streamed.

http://smalltalk.uk.to/SmalltalkOnPi.html

Saturday 28 November 2015

Smalltalk.uk.to is now live

A site to show that SMALLTALK is for the UK TOo!  (As a mnemonic, that's very contrived.  I agree)

Currently, it has a proposed interim logo for Cuis, and a video showing Pharo4 and Squeak 5 running side by side on a Raspberry Model B (the original 700MHz model)