Space Efficiency in Implementation.

Nhc98 generates code that uses a modified G-machine to evaluate Haskell
programs.  Most of the implementation is taken from Simon Peyton Jones'
book "The Implementation of Functional Programming Languages" (1987).

It was clear from the beginning that space would be a problem for the
compiler when trying to recompile itself on a 4Mb machine (of which 1Mb
was used by screen memory and operating system!)  Space usage can be
subdivided into three groups:

  * code and static data
  * heap
  * stacks


Code and Static Data.

There are two ways to decrease the size of nhc98's code:
  * generate smaller code for a given source file
  * write a smaller compiler e.g. less than 20,000 lines of source code.
Nhc98 tries to do both, and currently produces a binary for nhc98comp
which is less than 1Mb.

Generate Smaller Code.

It is possible to generate a small binary by eliminating common
subexpressions and inserting clever abstractions which makes it possible
to combine code from different parts of the source.  Unfortunately,
these transformations use heap space, and were therefore abandoned early
on.

A "cheap" way to reduce the size of the generated code is to use
bytecode instead of machine code.  This means that nhc98 must include a
small interpreter in all generated programs, but this is not a huge
cost.  The saving in space must however be paid in speed.  A bytecode
interpreter make a lot of jumps whose destinations are decided very
late, and this is bad behaviour for modern processors which use heavy
pipelining and branch prediction.  The gain in space is however quite
substantial as most instructions fit in 2 or less bytes, which is much
better than the 4 bytes/instruction for normal RISC machine code.  The
gain can in many cases be better than the 50-75% gain from smaller
instructions, as the bytecode instruction set can be optimised for graph
reduction so that fewer instructions are needed for a given Haskell
function.  This optimisation is not done fully in nhc98 yet however.

Less Source Code.

Nhc98 has nearly no optimisations; it consists only of the necessary
parts in a Haskell compiler for the G-machine.  The total source code
for nhc98comp is 19,000 lines of Haskell, with 15,000 lines of C for the
runtime system, and 9,000 lines for the Standard Prelude and Libraries.

The Haskell source for nhc98comp can be divided into the following
parts:

  * Lexical - handwritten tokeniser.
    All identifiers are hashed when read sp that equal identifiers can
    share the same string in the heap.  Part of the cost of hashing all
    identifiers is paid back in the renaming phase, as all comparisons of
    identifiers only need to compare integers instead of strings.

  * Parser - handwritten parser.
    The parser is written with parser combinators utilising a
    continuation monad that allows backtracking but has an efficient
    commit combinator that gets rid of all the data that is only needed
    for backtracking.  The commit combinator is essential when parsing
    with backtracking, as we otherwise introduce a space leak starting
    with the first position of the source code where backtracking is
    possible.

  * Renaming - gives all identifiers unique names.
    All different identifiers are replaced with unique integers.  The
    integers are then used as indexes into a balanced tree holding the
    symbol table.  Another possibility would have been to use sharing,
    so that all instances of an identifier point to the same shared
    data.  In the latter case all information about an identifier must
    be known when renaming.  This can be solved by lazy evaluation and
    some circular data structures.  Unfortunately lazy evaluation can
    sometimes hold onto a lot of heap space which can be difficult to
    get rid of, especially with circular dependencies.  A decision to
    use as few circular data structures as possible was therefore taken
    for nhc98.

  * Scc - find srongly connected groups (needed for typechecking).

  * Type - type checking and insertion of dictionaries.
    Here circular data structures are used to insert dictionaries as no
    other method was known that didn't either have a nasty complexity
    (exponential in the number of nested let-bindings) or holds onto a
    lot of information before it could insert all dictionaries with an
    extra pass over the syntax tree.

  * Export - creates the interface file.

  * Case - simplifies pattern bindings and pattern matching.
    The removal of patterns is done without code duplication.  This
    means that some programs might test the same pattern more than once,
    but it saves space.

  * Lift - lambda lifting.
    This is a simple lambda lifter and doesn't use fully-lazy lambda
    lifting.  It only does what is necessary for the G-machine which
    can't evaluate code with free variables.

  * Code - intermediate code for some small optimisations.
    All applications with known functions are marked as vector
    applications or constant expressions if the arity is greater than
    the number of available arguments.  A vector application uses less
    space than a chain of binary apply nodes, the latter using at least
    2 pointers for every argument compared to one pointer/argument for
    vector applications.

  * G-code - generates bytecode (with a few peep-hole optimisations).


Heap.

The possibility to use a small heap depends mostly on the amount of data
that is live at a given moment, but if we want the program to finish in
a reasonable time, then also the rate that free heap is used is of
interest.  Nhc98 uses the following implementation choices to reduce the
amount of live data:

  * compact graph representation
    When designing the node layout, size was the main consideration (see
    Fig 1).  The lack of a dedicated application means that higher-order
    functions cost more (3 words per argument) than if a two-word
    application node were defined.  Profiling has shown that nhc98 uses
    a lot of vector applications with the apply function, over 20% of
    the heap at some times, so something should be done here.

  * pattern binding updates (Sparud 93)
    When a variable in a left-hand pattern is used then all variables in
    the pattern are set to point to their parts of the expression.  This
    means that the node representing the expression that we matched
    against can be released as soon as the first variable in the pattern
    is used instead of waiting for the last one.

  * indirection
    No updates are done by copying; instead the reduced node is
    overwritten with a pointer to the result.  If a function creates the
    result node then overwriting is used if the result is smaller or
    equal to the redex.  Copying does not cause any extra evaluation if
    the result node is evaluated before being copied, but we night lose
    sharing of the answer.  This was one of many reasons for the space
    leak discovered in RuncimanWakeling93.

  * tail calls
    If the last expression of a function is a call to a function then
    some compilers (e.g. hbc) rearrange the stack and jump to the new
    function. This might lead to space leaks as the node representing
    the old application hasn't been overwritten yet and therefore might
    hold on to unnecessary data.  In hbc this can be solved by using the
    option to zap redex, which overwrites all application nodes with a
    dummy node at the start of their evaluation.  Nhc98 instead tries to
    build the new application on top of the old application, rearranging
    the stack and then jumping to the function in the new application.
    If the new application is larger than the old application, then an
    indirection node is used to overwrite the old application before
    jumping to the new function.

  * sharing of zero-arity constructors
    Characters and all user-defined constructors with zero arity are
    allocated once outside the heap, i.e. all uses of [] or a character
    use the name node.  It is not possible to do the same for Int and
    Float, but small ints have a table outside the heap.


Stacks.

The only thing currently done for reducing the stack usage is that
arguments to functions aren't copied to the stack before reduction.  As
all reductions are on vector application nodes (VAP nodes) it's quite
cheap to fetch the arguments from the VAP nodes whenever they are needed.

Another difference between nhc98 and the G-machine in SPJ87 is that
nhc98's G-machine only uses one stack (see Fig 2).  The stack is
organised into frames according to a sketch on Lennart's whiteboard.  It
makes garbage collection slightly slower because of the need to keep
track of which values on the stack are pointers.  A nice property is
that it is easy to decide which functions are responsible for which
pointers on the stack, which is needed from some heap profiles (see
below).

It is possible to reduce the size of a frame if we use the following
observations:

  * Always true: vapptr points to the same node as the stack cell below
    fp'.

  * Mostly true: it is possible to get the constptr from the VAP node
    pointed at by vapptr.  The problem here is that pat-bind-update might
    overwrite the VAP node.  This can be solved by emitting code so that
    the VAP nodes that can be overwritten by pat-bind-update always are
    of the same kind.

It is also possible to remove the stack descriptor from the code if we
only allow pointers on the stack; this does however prevent the
possibility of using unboxed values.


Profiling.

An unplanned diversion in the development of the original nhc compiler
was the inclusion of heap profiling.  Heap profiling was originally
developed in hbc, but there are some extra possibilities when nhc98 is
used.