BlazeHtml RFC

Posted in: haskell.

Introduction

BlazeHtml started out on ZuriHac 2010. Now, Jasper Van der Jeugt is working on it as a student to Google Summer of Code for haskell.org. His mentors are Simon Meier and Johan Tibell. The goal is to create a high-performance HTML generation library.

In the past few weeks, we have been exploring the performance and design of different drafts of this library. Now, the time has come to ask some questions to the Haskell community – more specifically the future users of BlazeHtml as well as current users of other HTML generation libraries.

About this file

This document is a literate Haskell file. Here is a plain version of the document. It serves two purposes: (1) it explains our current ideas and (2) it asks you, as the reader, for feedback. If you want to run this file or experiment with the code, you need to check out the code from github:

git clone git://github.com/jaspervdj/BlazeHtml.git

Enter the newly created directory BlazeHtml using

cd BlazeHtml

and load this document using

ghci doc/RFC.lhs

Note that we placed a .ghci file in the BlazeHtml directory. It sets the correct include directories for ghci.

Notational preliminaries

A “string” is a sequence of Unicode codepoints. A value of type String is a concrete representation of a string; i.e. a Haskell list of Unicode codepoints. A value of type Text is another concrete representation of a string provided by the Data.Text library. “Encoding” a string means converting the sequence of Unicode codepoints to a sequence of bytes, using a format like UTF-8 or UTF-16.

A “HTML document” is a tree whose nodes are HTML elements with string-valued attributes and whose leaves are strings. “Rendering” a HTML document means converting it to a string that will result in the same HTML tree when parsed by a HTML parser.

Problem definition

The goal of the BlazeHtml project is to create a light-weight Haskell combinator language for HTML documents that can be rendered as efficiently as possible.

Supported string representations

Obviously, we need to fix the concrete string representations to be used for describing attributes and leaves of HTML documents. We have chosen to support both String values as well as Text values, as we assume that these are the most common representations for strings occuring in user code.

Q1: Are there other string representations that should be supported natively; i.e. without converting them to Data.Text or String first?

Note that we enable the OverloadedStrings language extension to also support string literals of type Text.

> {-# LANGUAGE OverloadedStrings #-}

Modules

We import the Prelude hiding some functions, to avoid clashes: head, id and div are all HTML elements. Since we do not use the corresponding Prelude functions in our program, we will just hide them instead of qualifying either the Prelude or our modules.

> import Prelude hiding (head, id, div)

There are different HTML standards. For example,

Q2: What HTML standards should the library at least support?

Q3: Which HTML version would you preferably use?

Currently, we decided to use the HTML 4 Strict standard, as it seems to be the most used one.

Our goal is that a description of a Html document using BlazeHtml looks as similar as possible to real HTML – and, if possible, even easier on the eyes. Hence, we want to provide for every HTML element and attribute a combinator with exactly the same name. However, this is not possible due to two reasons: (1) There are HTML element and attribute names that conflict with Haskell keywords, or Haskell naming conventions. (2) There are HTML elements having the same name as HTML attributes.

To solve the first problem, we adopt the convention that the combinator for a HTML element (or an attribute) that conflicts with a Haskell keyword (like class) is suffixed with an underscore (i.e class_ instead of class). Attributes like http-equiv (Haskell doesn’t like the ‘-’ character) will be written as http_equiv.

To solve the second problem, we split the combinators for elements and attributes into separate modules. This way the library user can decide on how to handle the conflicting names using hiding and/or qualified imports; e.g. we could qualify the attributes such that the ‘title’ attribute combinator becomes ‘A.title’.

Q4: What do you think of this approach for chosing combinator names?

Several HTML elements conflict with the Prelude; e.g. head or map. We are not sure how to resolve these clashes. Currently, we leave it up to the library user to use appropriate hiding and qualifying. This works fine, if the library user separates the busines logic from the presentation layer, and thus puts BlazeHtml templates in separate modules, where little logic is required. Another way is to also regard functions in the Prelude (or a bigger fixed set of libraries) as “Haskell keywords” and use underscore suffixing for name-conflict resolution.

Q5: Would you also regard the Prelude (or a bigger set of libraries) as fixed “Haskell keywords” and use underscore suffixing for conflict resolution?

Currently, we decided that all our modules will share the Text.Blaze prefix.

> import Text.Blaze.Html4.Strict hiding (map)
> import Text.Blaze.Html4.Strict.Attributes hiding (title)

Q6: Do you think Text.Blaze.X is a proper name for a module? Or should we drop Blaze and use Text.Html instead?

An advantage of using the Text.Html prefix is that the user can directly see what the module is meant for. A disadvantage is that the likelihood of module clashes on Hackages increases.

Two more imports to satisfy the compiler:

> import Data.Monoid (mconcat)
> import Control.Monad (forM_)

As you will see later, we will render our Html documents to UTF-8 encoded ByteStrings. For displaying these, we also need putStrLn from Data.ByteString.Lazy.

> import qualified Data.ByteString.Lazy as LB

Syntax

We will demonstrate our combinator langugage by example. This is the (simple) HTML document we want to produce:

<html>
    <head>
        <title>Introduction page.</title>
        <link href="screen.css" type="text/css" rel="stylesheet" />
    </head>
    <body>
        <div id="header">Syntax</div>
        <p>
            This is an example of BlazeHtml syntax.
        </p>
        <ul>
            <li>1</li>
            <li>2</li>
            <li>3</li>
        </ul>
    </body>
</html>

In BlazeHtml, we (ab)use do-notation to get a very light-weight syntax; i.e. monadic sequencing is used to represent concatenation of Html documents.

> page1 = html $ do
>     head $ do
>         title "Introduction page."
>         link ! rel "stylesheet" ! type_ "text/css" ! href "screen.css"
>     body $ do
>         div ! id "header" $ "Syntax"
>         p "This is an example of BlazeHtml syntax."
>         ul $ forM_ [1, 2, 3] (li . string . show)

This use has its cost, as we don’t support passing values inside the monad. Hence, return x >>= f != f x. We tried supporting passing values, but it cost too much performance.

The correct way out would be to drop this instance and have the user use the functions working on Monoids directly, as in the following example describing the same page:

> page2 = html $ mconcat
>     [ head $ mconcat
>         [ title "Introduction page."
>         , link ! rel "stylesheet" ! type_ "text/css" ! href "screen.css"
>         ]
>     , body $ mconcat
>         [ div ! id "header" $ "Syntax"
>         , p "This is an example of BlazeHtml syntax."
>         , ul $ mconcat $ map (li . string . show) [1, 2, 3]
>         ]
>     ]

The syntax choice is up to the end user. We tend to prefer the first notation, as we think it is more light-weight.

The main function just outputs the two pages:

> main = do
>     LB.putStrLn $ renderHtml page1
>     LB.putStrLn $ renderHtml page2

Q7: How do you think about this syntax, generally?

Q8: Do you think ! is a good operator for setting attributes?

We made an initial choice for ! because the old HTML package uses this. However, this operator looks more like array indexing. It is not too late to change this, suggestions are very welcome.

Q9: How should multiple attributes be handled?

In the above example, we used the ! again for the next attribute:

link ! rel "stylesheet" ! type "text/css" ! href "screen.css"

Another option would be to define a variant that takes a list of attributes:

link !> [rel "stylesheet", type "text/css", href "screen.css"]

Or, we could use a type class to give the ! different uses, and thus have:

link ! [rel "stylesheet", type "text/css", href "screen.css"]

The last option will, however, introduce a more complicated type for attributes, more complicated type errors, and a performance overhead in some cases.

Rendering & encoding

As said before, BlazeHtml supports strings represented either as String or Data.Text values. These two types support all Unicode codepoints, so the encoding format should support all Unicode codepoints, too. If the encoding format does not support all Unicode codepoints, then rendering and encoding cannot be separated nicely because unencodable characters must already be escaped accordingly during rendering.

We think that more than 95% of the end users won’t need support for lossy encodings. Hence, we choose not to support them. Note that all desktop browsers, and most mobile browsers support superior encodings.

Q10: Do you need support for “lossy” encodings, e.g. Latin-1? If yes, could you describe your use case more precisely?

Fixing the encoding statically greatly helps for achieving the best possible performance. Hence, we fix the encoding of rendered HTML documents to UTF-8 because this is the most used encoding for HTML documents.

Q11: What other encodings do you need support for?

Note that for a non-performance-critical code path you can always decode and re-encode the rendered and UTF-8 encoded HTML document.

Speed

Possibly, you think that the best representation for a rendered HTML document is a Text value. Then, one could use the functions from Data.Text to encode this Text value to the desired final encoding. However, this conflicts with the goal of being as efficient as possible. For maximal efficiency, one wants to spend as little work as possible for each byte that is output.

Hence, if we convert our data directly to the final encoding, then we save one intermediate representation. In our current prototype implementation, this is reflected by the fact that we build the UTF-8 encoded sequence of bytes directly using a slightly modified version of the Builder monoid from Data.Binary. This has the nice side-effect that the Lazy.ByteString generated by the Builder monoid consists of a list of big (32kb) chunks that can be sent over the network efficiently using the network-bytestring library.

Our preliminary benchmark suite shows that this is a very promising approach. You can run these benchmarks by calling

make bench-html

In the BlazeHtml directory that you created in the beginning of this RFC.

Note that these benchmarks also contain the “BigTable” benchmark that is implemented in many different templating engines. It measures the rendering time of a big <table> that has 1000 rows and 10 columns, and every row has the simple content 1, 2, 3, … 10. Our prototype library is much faster than other templating engines such as Spitfire, ClearSilver, ERB and Erubis. More information can be found in this blogpost.

Q12: Do you know of other libraries or benchmarks that we should compare to?

Epilogue

Most modern web applications embrace the MVC design pattern. In this pattern, BlazeHtml is part of the “View”. Two other components are needed – the “Model” (data retrieval & persistence) and the “Controller” (the server).

Q13: What other libraries would you use BlazeHtml with?

Q14: Do you see any problems with respect to integrating BlazeHtml in your favourite web-framework/server?

Looking forward to your feedback

The easiest way to send feedback is to reply by email to the haskell-cafe thread. Alternatively, drop a comment at reddit.

Jasper van der Jeugt and Simon Meier

Comments

ce0f13b2-4a83-4c1c-b2b9-b6d18f4ee6d2