Jasper Van der Jeugt, April 6 2011, as submitted to the Google Summer of Code 2011 program
The default String
type is easy to use and can be used for prototype applications, but it lacks the performance needed for real-world applications. ByteString
1 types offer great performance, but because they work with raw bytes, they are often awkward to use when the user wants text processing.
This is where the excellent Text
2 type, which offers a combination of speed and simplicity-of-use, comes in. For these reasons (among others), many applications and libraries already use the Text library intensively. This means that performance improvements in the Text library can have a large impact on many Haskell projects.
Initially, the Text library was written as part of Tom Harper’s master >dissertation at the University of Oxford 3. At this time, benchmarking seemed to indicate that UTF–16 would be a good choice for the internal encoding used in the library, because it provides a nice balance between UTF–8, which is compact but uses a more complicated encoding/decoding algorithm, and UTF–32, which uses a lot more memory but has a straightforward encoding/decoding algorithm.
Several benchmarks were published to support the Text library. However, they only timed the performance of the different encodings after the Text
had been decoded and read from a socket/file/… Additionally, they failed to take the time used to encode and write the Text
back into account.
If benchmarks would take this into account, the results might look different: since a lot of “Real World” is UTF–8 encoded, the initial conversion — read some bytes to a Text
data structure — and the final conversion — write the Text
back to a file — could be done very fast if Text
would use UTF–8 internally as well.
As is to be expected, there are both advantages and disadvantages to every representation, including UTF–8. A list of the most important ones:
Advantages:
Reduced memory usage (except for texts with, for example, many East Asian characters). Under UTF–8, ASCII characters only take up a single byte, which is a very good property because a lot of markup languages use such characters in their syntax (e.g., HTML uses ASCII tag names). This can be crucial for programs that need to hold a lot of text in memory (e.g., data mining applications).
UTF–8 is widely used in “Real World” applications, such as the world wide web 4. For these applications, using UTF–8 as the internal Text
representation means that initial and final conversions can be done in O(1)
time.
Disadvantages:
Text
.There are a number of different tasks in this project, with some dependencies between them. A brief overview of them:
I will collect and extend the benchmark suite for the Text library using two strategies:
Talk to the community using the Haskell-cafe mailing list, asking for open-source applications that make extensive use of the Text library. For this purpose, I can also use the reverse dependencies on Hackage (the Haskell package repository). Most of these benchmarks will require editing to allow easy interoperability with the different Text versions.
I plan to implement several benchmarks myself, modeling behavior of real applications. This is important because artificial benchmarks do not reflect the ways in which the library code will be used in a real-world application and the ways in which the generated machine code interacts with other components of these applications.
Following good evaluation practices, all benchmarks should uniformly use the Criterion 5 library.
An important goal is to have a good coverage of different languages in the various benchmarks. This requires including character sets such as Cyrillic and Arabic in addition to the classic western (i.e., Latin) character set.
The actual encoding/decoding functionality is located in the Data.Text.Fusion
module. It can be converted to UTF–8 in a reasonably small amount of time. However, I anticipate the following possible problems (since needs to be done right):
Branching is much more complicated for UTF–8, compared to UTF–16. When Tom Harper initially wrote the Text library, this complexity was given as one of the primary reasons for choosing UTF–16.
Micro-optimization is incredibly important in this module, because it is the very core of the Text library. To achieve a maximal optimization, I will have to look at the generated Haskell Core and figure out where further optimizations are possible.
Once the benchmarks have been implemented to my satisfaction, they will be used to compare the UTF–8 and UTF–16 implementations of the Text library. Based on rigorous experiments, I will employ the different profiling tools/techniques available for Haskell (cost-center profiling, heap profiles, studying generated core…) to allow identifying potential bottlenecks and areas where further optimization is possible.
Note that the tasks as described above will not be completed following a sequential scheme, but in short iterative cycles. This means that, once enough benchmarks have been collected, my work-flow will consist of:
Because of this iterative schedule, the concrete planning is not very detailed, but I can give some deadlines:
I also plan to regularly publish posts on my blog 6 — say one post every 14 days — about my progress on the project, and discuss interesting problems/obstacles I come across.
I think I make an excellent candidate to implement the above proposal, because I have: