Convert the text package to use UTF–8 internally

Jasper Van der Jeugt, April 6 2011, as submitted to the Google Summer of Code 2011 program

Introduction

The default String type is easy to use and can be used for prototype applications, but it lacks the performance needed for real-world applications. ByteString 1 types offer great performance, but because they work with raw bytes, they are often awkward to use when the user wants text processing.

This is where the excellent Text 2 type, which offers a combination of speed and simplicity-of-use, comes in. For these reasons (among others), many applications and libraries already use the Text library intensively. This means that performance improvements in the Text library can have a large impact on many Haskell projects.

Initially, the Text library was written as part of Tom Harper’s master >dissertation at the University of Oxford 3. At this time, benchmarking seemed to indicate that UTF–16 would be a good choice for the internal encoding used in the library, because it provides a nice balance between UTF–8, which is compact but uses a more complicated encoding/decoding algorithm, and UTF–32, which uses a lot more memory but has a straightforward encoding/decoding algorithm.

Several benchmarks were published to support the Text library. However, they only timed the performance of the different encodings after the Text had been decoded and read from a socket/file/… Additionally, they failed to take the time used to encode and write the Text back into account.

If benchmarks would take this into account, the results might look different: since a lot of “Real World” is UTF–8 encoded, the initial conversion — read some bytes to a Text data structure — and the final conversion — write the Text back to a file — could be done very fast if Text would use UTF–8 internally as well.

Expected advantages and disadvantages of using UTF–8

As is to be expected, there are both advantages and disadvantages to every representation, including UTF–8. A list of the most important ones:

Advantages:

Disadvantages:

Tasks

There are a number of different tasks in this project, with some dependencies between them. A brief overview of them:

1. Benchmarks

I will collect and extend the benchmark suite for the Text library using two strategies:

Following good evaluation practices, all benchmarks should uniformly use the Criterion 5 library.

An important goal is to have a good coverage of different languages in the various benchmarks. This requires including character sets such as Cyrillic and Arabic in addition to the classic western (i.e., Latin) character set.

2. Conversion

The actual encoding/decoding functionality is located in the Data.Text.Fusion module. It can be converted to UTF–8 in a reasonably small amount of time. However, I anticipate the following possible problems (since needs to be done right):

3. Evaluation

Once the benchmarks have been implemented to my satisfaction, they will be used to compare the UTF–8 and UTF–16 implementations of the Text library. Based on rigorous experiments, I will employ the different profiling tools/techniques available for Haskell (cost-center profiling, heap profiles, studying generated core…) to allow identifying potential bottlenecks and areas where further optimization is possible.

Schedule

Note that the tasks as described above will not be completed following a sequential scheme, but in short iterative cycles. This means that, once enough benchmarks have been collected, my work-flow will consist of:

Because of this iterative schedule, the concrete planning is not very detailed, but I can give some deadlines:

I also plan to regularly publish posts on my blog 6 — say one post every 14 days — about my progress on the project, and discuss interesting problems/obstacles I come across.

Conclusion

I think I make an excellent candidate to implement the above proposal, because I have: