Convert the text package to use UTF–8 internally

Jasper Van der Jeugt, April 6 2011, as submitted to the Google Summer of Code 2011 program

Introduction

The default String type is easy to use and can be used for prototype applications, but it lacks the performance needed for real-world applications. ByteString 1 types offer great performance, but because they work with raw bytes, they are often awkward to use when the user wants text processing.

This is where the excellent Text 2 type, which offers a combination of speed and simplicity-of-use, comes in. For these reasons (among others), many applications and libraries already use the Text library intensively. This means that performance improvements in the Text library can have a large impact on many Haskell projects.

Initially, the Text library was written as part of Tom Harper’s master >dissertation at the University of Oxford 3. At this time, benchmarking seemed to indicate that UTF–16 would be a good choice for the internal encoding used in the library, because it provides a nice balance between UTF–8, which is compact but uses a more complicated encoding/decoding algorithm, and UTF–32, which uses a lot more memory but has a straightforward encoding/decoding algorithm.

Several benchmarks were published to support the Text library. However, they only timed the performance of the different encodings after the Text had been decoded and read from a socket/file/… Additionally, they failed to take the time used to encode and write the Text back into account.

If benchmarks would take this into account, the results might look different: since a lot of “Real World” is UTF–8 encoded, the initial conversion — read some bytes to a Text data structure — and the final conversion — write the Text back to a file — could be done very fast if Text would use UTF–8 internally as well.

Expected advantages and disadvantages of using UTF–8

As is to be expected, there are both advantages and disadvantages to every representation, including UTF–8. A list of the most important ones:

Advantages:

Reduced memory usage (except for texts with, for example, many East Asian characters). Under UTF–8, ASCII characters only take up a single byte, which is a very good property because a lot of markup languages use such characters in their syntax (e.g., HTML uses ASCII tag names). This can be crucial for programs that need to hold a lot of text in memory (e.g., data mining applications).
UTF–8 is widely used in “Real World” applications, such as the world wide web 4. For these applications, using UTF–8 as the internal Text representation means that initial and final conversions can be done in O(1) time.

Disadvantages:

As before mentioned in the introduction, UTF–8 uses a more complex encoding/decoding scheme. This is partially countered by Stream Fusion in the Text library, which ensures that no intermediate encode/decode calls are needed when composing fusible functions that work on Text.

Tasks

There are a number of different tasks in this project, with some dependencies between them. A brief overview of them:

1. Benchmarks

I will collect and extend the benchmark suite for the Text library using two strategies:

Talk to the community using the Haskell-cafe mailing list, asking for open-source applications that make extensive use of the Text library. For this purpose, I can also use the reverse dependencies on Hackage (the Haskell package repository). Most of these benchmarks will require editing to allow easy interoperability with the different Text versions.
I plan to implement several benchmarks myself, modeling behavior of real applications. This is important because artificial benchmarks do not reflect the ways in which the library code will be used in a real-world application and the ways in which the generated machine code interacts with other components of these applications.

Following good evaluation practices, all benchmarks should uniformly use the Criterion 5 library.

An important goal is to have a good coverage of different languages in the various benchmarks. This requires including character sets such as Cyrillic and Arabic in addition to the classic western (i.e., Latin) character set.

2. Conversion

The actual encoding/decoding functionality is located in the Data.Text.Fusion module. It can be converted to UTF–8 in a reasonably small amount of time. However, I anticipate the following possible problems (since needs to be done right):

Branching is much more complicated for UTF–8, compared to UTF–16. When Tom Harper initially wrote the Text library, this complexity was given as one of the primary reasons for choosing UTF–16.
Micro-optimization is incredibly important in this module, because it is the very core of the Text library. To achieve a maximal optimization, I will have to look at the generated Haskell Core and figure out where further optimizations are possible.

3. Evaluation

Once the benchmarks have been implemented to my satisfaction, they will be used to compare the UTF–8 and UTF–16 implementations of the Text library. Based on rigorous experiments, I will employ the different profiling tools/techniques available for Haskell (cost-center profiling, heap profiles, studying generated core…) to allow identifying potential bottlenecks and areas where further optimization is possible.

Schedule

Note that the tasks as described above will not be completed following a sequential scheme, but in short iterative cycles. This means that, once enough benchmarks have been collected, my work-flow will consist of:

convert parts of the Text library to use UTF–8;
evaluate the performance using the benchmarks;
find the bottlenecks and areas where optimization is possible;
implement optimizations there;
add benchmarks if required;
repeat.

Because of this iterative schedule, the concrete planning is not very detailed, but I can give some deadlines:

benchmarks should be collected before July 1, 2011;
improvements to the benchmarks can be made until the midterm evaluation — July 15, 2011;
the library should be ported (with some optimizations) before August 1, 2011;
all possible optimizations should be implemented August 10, 2011.

I also plan to regularly publish posts on my blog 6 — say one post every 14 days — about my progress on the project, and discuss interesting problems/obstacles I come across.

Conclusion

I think I make an excellent candidate to implement the above proposal, because I have:

an excellent knowledge of Haskell in general;
experience in writing/benchmarking/optimizing fast Haskell code (blaze-html 7);
looked at the Text internals before, when we were looking to write a Text Builder Monoid;
written a fast UTF–8 encoder before (blaze-builder 8).