| UTF-8 / Unicode & DBCS |
|
| What is UTF-8? | UTF-8
(8-bit Unicode Transformation Format) provides a method
to store UNICODE data in 8 bit bytes. It uses
variable-length character encoding to represent the
various Unicode values. It is able to represent any
universal character in the Unicode standard, yet
maintains backwards compatible with ASCII. It is because of the fact that it provides direct access to UNICODE character set while maintains 8-bit characters that has it been embraced by PxPlus and is steadily becoming the preferred encoding for email, web pages, and other places where characters are stored or streamed. In addition UTF-8 is commonly used on Web pages in order to display extended characters sets. Inrernally UTF-8 uses one to four bytes (strictly, octets) per character, depending on the Unicode symbol. Only one byte is needed to encode the 128 US-ASCII characters (Unicode range U+0000 to U+007F). Two bytes are needed for Latin letters with diacritics, combining them. Also two bytes are used to represent a character in Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets (Unicode range U+0080 to U+07FF). Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in other planes of Unicode and although supported by PxPlus, are generally not commonly used. The following table shows how this encoding is done:
In the above table, the first column represents the Unicode value in hex ranging from 0 through the potential maximum four byte value. Depending on the Unicode value the system will use 1 through 6 UTF-8 bytes. The first byte of the UTF-8 sequence indicates the number of bytes in the sequence. If the Unicode vaue is between 0 and 127 ($00$ through $7F$) then the UTF-8 byte will be the same. If it is in the range of between $80$ and $7FF$ the first byte will be in the range of $C0$ through $DF$. The first byte will consist of $C0$ Or'ed with the top 5 bits of the Unicode value followed by the next bits from the Unicode value Or'ed with $80$. As the Unicode value increases so does the number of the UTF-8 bytes needed to represent the data. |
|||||||||||||||||||||
| Example | Assuming
you wanted to display the output "Good Morning"
in Chinese or 早晨. These two characters represent
Unicode values 26089 and 26216. To have the system generate a UTF-8 character string containing 早晨 you could use the CVS function:
Internally this would have the value of $E697A9$+$E699A8$. |
|||||||||||||||||||||
| System Parameter | The
'U8' System Parameter is used to control the UTF-8 logic in the
system. This parameter contains a series of
bits (flags) used to control the processing of UTF-8
data. If the 'U8' parameter is zero (all bits off -- default setting) then no special processing will be done for UTF-8 support, however if non-zero the following bits will apply:
To enable UTF-8 logic you will set the value of the 'U8' system parameter to a value made up of the above sum of the above option values. |
|||||||||||||||||||||
| Seperators | One
of the challenges in implementing UTF-8 encoding is that
the standard ProvideX field separator character ($8A$)
can occurs within the UTF-8 data string. PxPlus handles this problem by only detecting the field separator when not processing a UTF-8 sequence. Internally when looking for the field seperator the system will skip over any UTF-8 encoded values. This resolves the problem since all UTF-8 encoded string start with a hex $C0$ or above. |
|||||||||||||||||||||
| Functions | OPT=
for selective override The functions LEN, MID, POS, UCS, LCS, CVS support a ,OPT="..." specification that can be used to override the default UTF-8 encoding as defined above. If OPT="U" is added to the end of any of these functions, the system will consider the data being processed as UTF-8. If OPT="B" is present, then the data will be consider standard binary/ASCII values. For Example:
The OPT= has special meaning on the LEN function. If ,OPT="U" is passed to the LEN function, the input is considered to be UTF-8 and the system will return the length in terms of actual characters. An ,OPT="B" is ignored in the LEN function. |
|||||||||||||||||||||
| UTF-8 & Windows | When
the 'U8' parameter is enabled on Windows, PxPlus will
automatically start supporting Unicode input on all
controls and graphical output. The system will translate
all UNICODE to UTF-8 format (and vice-versa) in order the
allow the application to function normally. This includes
menus, multi_lines, list_boxes, drop_boxes, buttons,
check_boxes, radio_buttons along with all graphical
mnemonics (except 'Caption'). In addition the CLIP_BOARD READ and WRITE directives will convert the data to/from UNICODE in order to preserve its contents. Current Limitations The following are the current limitations within the Windows implementation:
|
|||||||||||||||||||||
| Example | To
display a prompt on the screen containing "Enter
Product" in Chinese... The chinese text for "Enter Product" in Unicode is 36664,20837, 29986, 21697 or 輸入產品 So to display it on the screen you could use
Or to create a button:
These values can be stored and retrieved from a file, as such the Nomads screen designer can be used to draw the screen and enter the text. Simply make sure the 'U8' parameter is set. We Strongly recommend that you set this in your START_UP program if planing on using it. |
|||||||||||||||||||||
| * Note * | To simplify the conversion of the Euro currency symbol (€) which historically has been coded as $80$ in Windows, the UTF-8 convertor will automatically change $80$ to $20AC$ which is its associated Unicode equivalent. Note that this translation is only done when going from UTF-8 to Unicode, not vice versa as application should start using the proper Unicode values. | |||||||||||||||||||||