Introduction
I have been working on ES.hs for a month or two. ES.hs is a compiler that compiles Haskell to ES6 code. There are already a few options out there: there is GHCJS, which takes over after the LLVM phase. This renders communication with Javascript not very straightforward but in my opinion has high potential in WebAssembly. There also is Haste, which seems to carry a huge chunk of runtime with the code it generates. Different from them, ES.hs uses the GHC frontend and a part of the backend as a library until after STG, which keeps the naming structure and takes the form of simply typed lambda calculus. Therefore, the code EH.hs generates interpolate nicely with native Javascript: for example a Haskell function Data.List.length
simply becomes Data/List.js :: function length()
. Another advantage is that the generated code does not carry a runtime because ES6’s lambda facility is quite usable, at least usable to the extent that it maps nicely to STG.
However, I have found that code generation in Javascript can hardly be called “difficult” compared to leveraging the famous Glasgow Haskell Compiler. Every hour I spent on the code generation part of ES.hs, I spent at least double the time trying to understand the dozen different types of intermediate results GHC produces. Fortunately, the subreddit r/haskell and freenode #ghc provided me with tons of help. A big thank-you goes to them, and I am now starting this series of blogs to write down what I learned about GHC so that people could suffer less in the future. GHC 8.6.1 is the latest stable version and the one I am using. Therefore, things in this blog should apply to 8.6.1 but may not apply to a later or former version of GHC, as its API is constantly changing.
GHC Core
How GHC compiles one single module in a high-level view is moderately explained in the GHC development commentary of HscMain. However, I think a better way to approach GHC is to understand how it represents simple Haskell elements internally. The Core language is one of GHC’s central data types. Essentially, it is type-checked, desugared, and renamed, and it carries all the essential information from the source code. If you ever wrote Haskell, you must have fought with the type-checker and cursed it with the most ferocious language you can think of: this is what the type checking phase does. Desugaring is a little bit more interesting. There are a lot of syntactic-sugar constructions in Haskell: pattern-guards, do-notation, etc. What the desugaring phase does is first to evaluate Template Haskell, then to convert all of these fancy syntaxes into the very concise but equally expressive Core language. Lastly, the renaming phase does one simple but important thing: it analyzes variable scopes and renames them so that no two variables share the same name. In a post renaming world, we don’t need to worry about shadowing or variable scope in general anymore.
Let’s now look into the Core language, whose definition lives inside the CoreSyn
module. If you click on the link to its document, you should notice right-away the Expr b
data type, which represents an expression in Core. This is very important because everything (except type-level stuff which we’ll touch on later) in Haskell is an expression, which is just something that has a value in its context.
Let’s go through each of Expr b
’s constructors, but ignore the b
type argument for now.
-
Var Id
A variable with an identifier, such as the
y
inx = y
. Becausex
in this context does not have a value, it is not an expression. We will talk more about this later. -
Lit Literal
A literal, such as the
42
inx = 42
. -
App (Expr b) (Arg b)
A function application, such as
length xs
, wherelength
is theExpr b
that got applied on (which is itself aVar
in this case), andxs
is the argumentArg b
. Remember that every function is curried so we have only one argument here. The typeArg b
is just a synonym forExpr b
. -
Lam b (Expr b)
A function construction (or lambda). For example,
\x = x + 1
is a lambda wherex
is theb
which represents the argument (again, there is only one because of currying), and the bodyx+1
is represented by theExpr b
part. Notice howx
is not an expression as it doesn’t have a value in its context. -
Let (Bind b) (Expr b)
A let binding with its body. Yes, a let binding isn’t just syntactic sugar but a full-blown expression as well! Consider this infinite list of ones,
let x = 1:x in tail x
. If a let binding is just syntactic sugar that gets eliminated/desugared during compilation, the compiler would have to fully generate and evaluate the list[1, 1, 1, …]
and assign it tox
. This is impossible due to the infinite nature of the list 1. Therefore, a let binding is very different from a#define
in C and as we will see a very integrated part of the Haskell language. Back to the example, thex = 1:x
part of the let expression is represented by theBind b
part of the constructor, and thetail x
part is represented by theExpr b
in the second part. But what exactly isBind b
?b
: Firstly, we’ll have to explain whatb
is. Essentially,b
is the type of the things we bind values on and it may vary across different GHC phases. In the Core phase, for example,b
==Id
, which is an identifier with type, kind and category 2 information.Bind b
: If we check the document, we can see thatBind b
has two data constructors. There isNonRec b (Expr b)
andRec [(b, Expr b)]
. Therefore, intuitive enough, aNonRec b (Expr b)
represents a non-recursive binding of theExpr b
onto theb
, and aRec [(b, Expr b)]
represents a set of recursive bindings in a similar fashion. Notice that our example would be represented by aRec
that looks likeRec [(x, 1:x)]
because it is self-recursive.
-
Case (Expr b) b Type [Alt b]
A general pattern matching. Consider this example of pattern matching using case on simple integers:
case x + 1 of y 1 -> 11 2 -> y + 3 _ -> 42
which would roughly translate to
Case (x + 3) y Int {1: 100, 2: y + 3, default: 42}
. You may ask: what is they
doing here? This is actually a feature of GHC that isn’t seen very much. In the case body,y
is bound to the value ofx + 1
. Therefore, whenx == 1
, the case expression returns5
. With that explained, the first argument of aCase
is simply the thing we are pattern matching on; the second argument is they
which we formly called the binder; the third argument is the type of the expression (in this caseInt
or strictly speakingIntegral a => a
); and the fourth argument is a list of the “alternatives”, or the cases.Alt b
: represents a case. It is the type synonym for the tuple type(AltCon, [b], Expr b)
. To explain this, another example is in order:case maybeValue of Just x -> x + 2 _ -> 5
- The first
AltCon
value of the tuple represents the construction of the case. It has three data constructors:DataAlt DataCon
,LitAlt Literal
, andDEFAULT
. ADataCon
represents a data constructor, so in the first case ofJust x
, theAltCon
value isDataAlt Just
. In our previous example, theAltCon
value is quite intuitivelyLitAlt 1/2
in the 1/2 cases. AndDEFAULT
is just theAltCon
for_
cases. - The second
[b]
value is the binding variables for the constructor. In theJust x
case, for example, we have[b]
==[x]
. - The third
Expr b
value is the expression following the pattern matching. In theJust x
case, it isx + 2
.
- The first
Notice that unlike
Let
, we don’t have a construction for recursive bindings. This is because you can’t do recursion in a pattern matching. Try it! Also notice that we are pattern matching against one construction at a time. What about composite patterns such asJust (Right 5)
? We don’t need to worry about this because the desugarizer would desugar it into two separate pattern matchings. -
Cast (Expr b) Coercion
Because in the Core phase all type information is preserved, we still need to do type coercions as specified in System FC. Ignore this for now if you don’t know what it is.
-
Tick (Tickish Id) (Expr b)
A tick is an attached piece of information on an expression for optimization purposes. Removing or adding ticks shouldn’t change the behavior of a program. I haven’t done much research about it and ignore all ticks in ES.hs.
-
Type Type
A type value. Because the Core language is in System FC, types are explicitly passed around as regular values. Therefore, we need to be able to represent them as expressions.
-
Coercion Coercion
Ditto.
So this is the entirety of the GHC Core language! Later on we should explain how Type
works, and specifically work on DataCon
and TyCon
which represent data constructors and type constructors.
-
Whether it is infinite can’t be determined because of the Halting problem. ↩
-
I made this one up as I don’t have a proper term for it. Basically it is the information encoded by the
IdInfo
data type. ↩