Text/UTF-8: Initial results
Published on July 10, 2011 under the tag haskell
What is this?
I’ve been working on the Data.Text library for my Google Summer of Code project under the wise supervision of Edward Kmett (but to be fair, Johan Tibell has also been helping me a lot, for which I am very grateful as well).
Some more context
Porting the Text library to use UTF-8 as internal encoding (instead of the currently used UTF-16) is a rather tricky task because there isn’t “one” representative benchmark: the results depend on:
- The nature of the program: read once, serve statically vs. do lots of in-memory manipulations vs. read, manipulate, write…)
- The language used: for example, ASCII only takes up half the number of bytes when comparing UTF-8 against UTF-16, whereas some East-Asian characters take up three bytes instead of two.
Tihs means that when a change is made, not all benchmark results change in the same direction. Hence, we aim for either making them faster, or keeping them as fast as they currently are, and have them use less memory.
On to the results!
For the record, the test files used are an ASCII text file of 58M and a russian file of 62M. Note that while they are of (more or less) the same size, the ASCII text is a lot longer when counting characters instead of bytes (about twice as long).
Uppercasing
The first program I benchmarked for this blogpost is a simple one which converts
a file to uppercase. Note the fact that the input and output needs to happen in
UTF-8 (which is usually the case), but that this doesn’t make such a big impact
as you might initially think: by using Stream Fusion, the UTF-16 version will
also read the input from the ByteString
directly.
module Main where
import System.Environment (getArgs)
import qualified Data.ByteString as B
import qualified Data.Text as T
import qualified Data.Text.Encoding as T
main :: IO ()
= do
main : _) <- getArgs
(filePath <- fmap T.decodeUtf8 $ B.readFile filePath
text $ T.encodeUtf8 $ T.toUpper text B.putStr
Word frequencies
Because Stream Fusion plays such an important role in the first benchmark, Johan
suggested I used another benchmark in which the Text
values are used as keys
in a Map
– that way, they are forced to be reified as an UTF-8 encoded array.
Word frequencies is a simple example of such a program; the next benchmark finds the most used word in a text (“the” for the English ASCII text and “и” for the Russian text).
import Data.List (foldl', maximumBy)
import Data.Map (Map)
import Data.Ord (comparing)
import System.Environment (getArgs)
import qualified Data.ByteString.Char8 as B
import qualified Data.Map as M
import qualified Data.Text as T
import qualified Data.Text.Encoding as T
import qualified Data.Text.IO as T
frequencies :: Ord a => [a] -> Map a Int
= foldl' (\m k -> M.insertWith (+) k 1 m) M.empty
frequencies
top :: Map a Int -> a
= fst . maximumBy (comparing snd) . M.toList
top
main :: IO ()
= do
main : _) <- getArgs
(filePath <- fmap T.decodeUtf8 $ B.readFile filePath
text $ top $ frequencies $ T.words $ T.toLower text T.putStrLn
The results for this benchmark are very similar for both UTF-8 and UTF-16. This
is reasonable because I wrote a new Ord
instance for Text
which performs
better for medium-to-long Text
values (e.g. more than 32 characters) but
slightly worse for small Text
values, because there is the overhead of an
extra function call. This Ord
instance is heavily used in the above benchmark,
because of the Map
container – But these results indicate that there is no
real degradation here, so I can rejoice.
What’s next
There’s still lots of work to be done. In the next blogpost, I hope to talk about reduced memory usage and its implications.