ZSTRING released in Eiffel-Loop 1.3.1

by Finnian Reilly (modified: 2016 Mar 17)

Preface

Since 2012 I have been interested in the idea of creating a memory efficient alternative to STRING_32 that does not sacrifice performance. In 2013 I published this article which describes and benchmarks the class EL_ASTRING (aka ASTRING). While ASTRING peformed well against STRING_32 and the Gobo UC_UTF8_STRING in terms of both memory consumption and performance, it had one fundamental problem which I felt would prevent it from ever gaining wide acceptance. This was that each ASTRING instance was limited to a maximum of 26 unicode characters that could be used to augment the base latin character set.

In the fall of 2015 I came up with a better algorithm which did not suffer the limitations of ASTRING. Although the new design was somewhat less memory efficient for storing characters wider than 8-bits, it performed well overall. I decided to call it ZSTRING. In November 2015 I announced work in progress on ZSTRING in this article.

I am happy to announce that this project is now completed and since January I have migrated all my projects to use ZSTRING. ZSTRING is now available to download as part of the latest 1.3.1 release of Eiffel-Loop.

Finnian Reilly

What is ZSTRING?

ZSTRING is a memory efficient alternative to the standard STRING_32 class. If your application processes text that is mainly written in a language that is encodeable by a latin character set, you can expect up to a 75% saving in memory. There is a small trade off in performance, but not by much as the benchmarks show. Actually there is not always a trade off. Some ZSTRING string operations are slightly faster than their STRING_32 equivalent.

How is ZSTRING different from STRING_8?

With STRING_8 you are limited to characters from the ISO-8859-1 character set. A ZSTRING instance can be initialized from any STRING_32 string and you are guaranteed to always get back the same STRING_32 string with a call to the function {ZSTRING}.as_string_32. You can also convert it to UTF-8 with a call to {ZSTRING}.to_utf_8.

How ZSTRING Works

ZSTRING uses a combination of two different character arrays to represent the text. The first array is defined as:

area: SPECIAL [CHARACTER_8]

When initialized from a STRING_32, area holds every character that is encodeable using a system defined latin character-set (ISO-8859-15 by default). Any unicode character that is not encodeable as latin, is denoted by the control code 26 'SUB' character. SUB stands for substitution.

The unencoded array

The second array is defined as:

unencoded_area: SPECIAL [NATURAL_32]

This array is used to store any characters that are not encodeable by the latin codec. The array is structured as a sequence of substrings. The first two cells of each substring record the indices defining the substring interval. As an example to illustrate, consider the string:

"#61 ZhÅng FÃº ä¸åš Inner Truth"

It contains a total of 27 characters. 24 of these are encodeable as Latin-15 and 3 are not. The three that are not encodeable are stored in the unencoded_area array as shown in the following diagram:

The olive colored cells store the interval indices of the 2 substrings "Å" and "ä¸åš".

The Default Codec

By default, ZSTRING uses the codec class EL_ISO_8859_15_ZCODEC to convert unicode strings. However you can override this choice by calling the routine:

{EL_SHARED_ZCODEC}.set_system_codec

For a desktop application you could set the system codec according to the user locale. The following class offers a convenient way to create codec instances from an integer arguement:

class EL_SHARED_ZCODEC_FACTORY inherit EL_FACTORY_CLIENT feature {NONE} -- Factory new_iso_8859_codec (id: INTEGER): EL_ISO_8859_ZCODEC require not_iso_8859_12: id /= 12 valid_id: (1 |..| 15).has (id) do Result := ISO_8859_factory.instance_from_type (ISO_8859_codec [id], agent {EL_ISO_8859_ZCODEC}.make) end

The ZSTRING Test Suite

ZSTRING has a comprehensive test suite that verifies the results of all string operations against STRING_32.

See ZSTRING_TEST_SET also TEST_STRING_CONSTANTS

ZSTRING for Asian text

If your application is mainly processing Asian text, it is obvious that the current implementation of ZSTRING will incur a significant penalty in terms of both memory and runtime performance. The solution to this problem is to use a different source compatible implementation of ZSTRING adapted for Asian text that can be set with an ECF option.

A 2-byte Solution

The solution is in inherent in the fact that most Asian characters will fit into 2 bytes. By changing the storage arrays to the following definition it should be possible to create an efficient ZSTRING for Asian text

area: SPECIAL [NATURAL_16] unencoded_area: SPECIAL [NATURAL_32] Now the criteria for whether the character is stored in the unencoded_area is if it does not fit into 2-bytes. As it is unlikely that you will have many consecutive runs of 3 or 4-byte characters, the current scheme of using 2 cells for interval onsets and offsets makes no sense. Instead we should assume that over-sized characters occur in isolated instances and we only need to store the index. However some research is required to verify these assumptions.

Ideally area should be defined as CHARACTER_16, but as this is not part of standard Eiffel, NATURAL_16 can be used as a workaround.

Perhaps in the future some Asian coder will be interested to take on the challenge of implementing this.

A 4-byte Solution

Of course an easier option is to create a source compatible version of ZSTRING that uses the same implementation as STRING_32.

Benchmark Notes

You can find a set of benchmarks at the following URL comparing ZSTRING and STRING_32 in terms of memory consumption and runtime performance.

http://www.eiffel-loop.com/benchmarks/ZSTRING-benchmarks-latin-15.html

The Test Data

The test data used in these benchmarks is shown at the bottom of the benchmark page. The data is composed of 64 rows and 4 columns referenced as A, B, C, D. The Input column indicates which combination of columns have been selected for each test. For example "$A $D" means that each input string is a string from column A joined with a string from column D. Each runtime performance test is repeated for each of the 64 rows.

The columns have different storage characteristics as follows:

Columns A and D consist solely of characters that can be encoded as Latin-15. Column B is composed of characters, some of which will need to be stored in the unencoded_area array. Column C is composed of characters, all of which will need to be stored in the unencoded_area array.

Memory Consumption

Table 1 shows memory consumption under optimal conditions, where all the text data can be encoded as Latin-1.

Table 2 shows memory consumption under sub-optimal conditions, where some proportion of the text must be stored in the unencoded_area array.

Runtime Performance

Runtime performance is benchmarked using the 25 most common string operations. The performance figures are the average of 1000 runs for each benchmark test operation. The results are sorted in order of descending ZSTRING performance rank. Table 3 shows runtime performance under optimal conditions, where all the text data can be encoded as Latin-1. Table 4 shows performance under sub-optimal conditions, where some proportion of the text must be stored in the unencoded_area array. Table 4 has more input permutations than table 3.

I Ching Hexagram Test Strings

Table 5 at the bottom of the page shows the data used for the benchmarks. The data are names and descriptions of the 64 I Ching hexagrams divided into 4 columns as follows: A. Hexagram number B. Chinese names as pinyin phonetic spelling C. Chinese names as chinese characters D. English description

ZSTRING argument options

Most ZSTRING routines have two alternate forms as follows:

taking arguments of type EL_READABLE_ZSTRING
taking arguments of type READABLE_STRING_GENERAL

For example:append_string append_string_general

Extra ZSTRING routines

ZSTRING has a number of extra routines not found in the standard Eiffel string classes.

substituted_tuple alias "#$" (inserts: TUPLE): like Current

This routine is inspired by the Python string formatting or interpolation operator '%'. It is not quite as sophisticated as the Python operator as it cannot do numeric formatting, but nevertheless is very useful. It means any ZSTRING can be used as an interpolation template with the addition of '%S' markers (AKA '#').

translate (old_characters, new_characters: EL_READABLE_ZSTRING)

This routine is inspired by the Python method string.translate(s, table[, deletechars]). ZSTRING has a number of different variations of this routine one of which is able to do character deletion.

substring_index_list (str: EL_READABLE_ZSTRING): like internal_substring_index_list

This returns a list of substring indices as an arrayed list.

substring_intervals (other: EL_READABLE_ZSTRING): EL_SEQUENTIAL_INTERVALS

The class EL_SEQUENTIAL_INTERVALS is a high performance alternative to using ARRAYED_LIST [INTEGER_INTERVAL] both in terms of memory consumption and runtime performance.

escaped (escaper: EL_CHARACTER_ESCAPER [EL_ZSTRING]): like Current

Returns an escaped copy of current string

unescape (escape_table: EL_ESCAPE_TABLE)

Unescapes the current string

Source Code Links

class EL_ZSTRING aka ZSTRING

class EL_READABLE_ZSTRING

class EL_UNENCODED_CHARACTERS

Comments

Berend de Boer (9 years ago 16/3/2016)
Are you aware of Gobo's UC_STRING? That's basically UTF-8 (there is also a UTF-16 variant), and that has been a very stable solution for me for many years.
- Finnian Reilly (9 years ago 17/3/2016)
  gobo class UC_UTF8_STRING
  
  hi Berend
  
  I published this article in 2013 which benchmarks UC_UTF8_STRING against STRING_32 and ASTRING (a precursor to ZSTRING). These benchmarks show that UC_UTF8_STRING is extremely slow for some commonly used operations. I don't recall now the UTF-16 variant or perhaps it's something new.
  
  UC_UTF8_STRING Benchmark Screenshot
- Finnian Reilly (9 years ago 17/3/2016)
  UC_UTF8_STRING is broken
  
  Besides being very slow, UC_UTF8_STRING is also broken, as this test shows. (I am using the version distributed with EiffelStudio 15.01.9)
  
  class TEST_UC_UTF8_STRING inherit EQA_TEST_SET feature -- Test routines test_replace_substring_all note testing: "covers/{UC_UTF8_STRING}.replace_substring_all" local str_32, original_32, new_32: STRING_32 str_utf8, original_utf8, new_utf8: UC_UTF8_STRING readable_str_utf8: READABLE_STRING_GENERAL do create str_32.make_from_string (Hex_10_pinyin) create str_utf8.make_from_string_general (Hex_10_pinyin) create original_32.make_empty original_32.append_code (Hex_10_pinyin.code (3)) create original_utf8.make_empty original_utf8.append_item_code (Hex_10_pinyin.code (3).to_integer_32) create new_32.make_from_string (Hex_10_chinese) create new_utf8.make_from_string_general (Hex_10_chinese) str_32.replace_substring_all (original_32, new_32) str_utf8.replace_substring_all (original_utf8, new_utf8) readable_str_utf8 := str_utf8 -- Workaround for `to_string_32' visibility issue assert ("same result", readable_str_utf8.to_string_32 ~ str_32) end feature {NONE} -- Constants Hex_10_chinese: STRING_32 = "å°ç•œ" Hex_10_pinyin: STRING_32 = "XiÇŽo ChÃ¹" end
  - Eric Bezault (9 years ago 22/3/2016)
    UC_UTF8_STRING fixed
    
    Thank you for reporting this problem. It's now fixed in the Git repository of Gobo in Github.