Notes on GHC (1) - The Core Language

Introduction

I have been working on ES.hs for a month or two. ES.hs is a compiler that compiles Haskell to ES6 code. There are already a few options out there: there is GHCJS, which takes over after the LLVM phase. This renders communication with Javascript not very straightforward but in my opinion has high potential in WebAssembly. There also is Haste, which seems to carry a huge chunk of runtime with the code it generates. Different from them, ES.hs uses the GHC frontend and a part of the backend as a library until after STG, which keeps the naming structure and takes the form of simply typed lambda calculus. Therefore, the code EH.hs generates interpolate nicely with native Javascript: for example a Haskell function Data.List.length simply becomes Data/List.js :: function length(). Another advantage is that the generated code does not carry a runtime because ES6’s lambda facility is quite usable, at least usable to the extent that it maps nicely to STG.

However, I have found that code generation in Javascript can hardly be called “difficult” compared to leveraging the famous Glasgow Haskell Compiler. Every hour I spent on the code generation part of ES.hs, I spent at least double the time trying to understand the dozen different types of intermediate results GHC produces. Fortunately, the subreddit r/haskell and freenode #ghc provided me with tons of help. A big thank-you goes to them, and I am now starting this series of blogs to write down what I learned about GHC so that people could suffer less in the future. GHC 8.6.1 is the latest stable version and the one I am using. Therefore, things in this blog should apply to 8.6.1 but may not apply to a later or former version of GHC, as its API is constantly changing.

GHC Core

How GHC compiles one single module in a high-level view is moderately explained in the GHC development commentary of HscMain. However, I think a better way to approach GHC is to understand how it represents simple Haskell elements internally. The Core language is one of GHC’s central data types. Essentially, it is type-checked, desugared, and renamed, and it carries all the essential information from the source code. If you ever wrote Haskell, you must have fought with the type-checker and cursed it with the most ferocious language you can think of: this is what the type checking phase does. Desugaring is a little bit more interesting. There are a lot of syntactic-sugar constructions in Haskell: pattern-guards, do-notation, etc. What the desugaring phase does is first to evaluate Template Haskell, then to convert all of these fancy syntaxes into the very concise but equally expressive Core language. Lastly, the renaming phase does one simple but important thing: it analyzes variable scopes and renames them so that no two variables share the same name. In a post renaming world, we don’t need to worry about shadowing or variable scope in general anymore.

Let’s now look into the Core language, whose definition lives inside the CoreSyn module. If you click on the link to its document, you should notice right-away the Expr b data type, which represents an expression in Core. This is very important because everything (except type-level stuff which we’ll touch on later) in Haskell is an expression, which is just something that has a value in its context.

Let’s go through each of Expr b’s constructors, but ignore the b type argument for now.

Var Id

A variable with an identifier, such as the y in x = y. Because x in this context does not have a value, it is not an expression. We will talk more about this later.
Lit Literal

A literal, such as the 42 in x = 42.
App (Expr b) (Arg b)

A function application, such as length xs, where length is the Expr b that got applied on (which is itself a Var in this case), and xs is the argument Arg b. Remember that every function is curried so we have only one argument here. The type Arg b is just a synonym for Expr b.
Lam b (Expr b)

A function construction (or lambda). For example, \x = x + 1 is a lambda where x is the b which represents the argument (again, there is only one because of currying), and the body x+1 is represented by the Expr b part. Notice how x is not an expression as it doesn’t have a value in its context.
Let (Bind b) (Expr b)

A let binding with its body. Yes, a let binding isn’t just syntactic sugar but a full-blown expression as well! Consider this infinite list of ones, let x = 1:x in tail x. If a let binding is just syntactic sugar that gets eliminated/desugared during compilation, the compiler would have to fully generate and evaluate the list [1, 1, 1, …] and assign it to x. This is impossible due to the infinite nature of the list ¹. Therefore, a let binding is very different from a #define in C and as we will see a very integrated part of the Haskell language. Back to the example, the x = 1:x part of the let expression is represented by the Bind b part of the constructor, and the tail x part is represented by the Expr b in the second part. But what exactly is Bind b?
- b: Firstly, we’ll have to explain what b is. Essentially, b is the type of the things we bind values on and it may vary across different GHC phases. In the Core phase, for example, b == Id, which is an identifier with type, kind and category ² information.
- Bind b: If we check the document, we can see that Bind b has two data constructors. There is NonRec b (Expr b) and Rec [(b, Expr b)]. Therefore, intuitive enough, a NonRec b (Expr b) represents a non-recursive binding of the Expr b onto the b, and a Rec [(b, Expr b)] represents a set of recursive bindings in a similar fashion. Notice that our example would be represented by a Rec that looks like Rec [(x, 1:x)] because it is self-recursive.
Case (Expr b) b Type [Alt b]

A general pattern matching. Consider this example of pattern matching using case on simple integers:
```
      case x + 1 of y
          1 -> 11
          2 -> y + 3
          _ -> 42
```
which would roughly translate to Case (x + 3) y Int {1: 100, 2: y + 3, default: 42}. You may ask: what is the y doing here? This is actually a feature of GHC that isn’t seen very much. In the case body, y is bound to the value of x + 1. Therefore, when x == 1, the case expression returns 5. With that explained, the first argument of a Case is simply the thing we are pattern matching on; the second argument is the y which we formly called the binder; the third argument is the type of the expression (in this case Int or strictly speaking Integral a => a); and the fourth argument is a list of the “alternatives”, or the cases.
- Alt b: represents a case. It is the type synonym for the tuple type (AltCon, [b], Expr b). To explain this, another example is in order:
```
      case maybeValue of
          Just x -> x + 2
          _ -> 5
```
  - The first AltCon value of the tuple represents the construction of the case. It has three data constructors: DataAlt DataCon, LitAlt Literal, and DEFAULT. A DataCon represents a data constructor, so in the first case of Just x, the AltCon value is DataAlt Just. In our previous example, the AltCon value is quite intuitively LitAlt 1/2 in the 1/2 cases. And DEFAULT is just the AltCon for _ cases.
  - The second [b] value is the binding variables for the constructor. In the Just x case, for example, we have [b] == [x].
  - The third Expr b value is the expression following the pattern matching. In the Just x case, it is x + 2.
Notice that unlike Let, we don’t have a construction for recursive bindings. This is because you can’t do recursion in a pattern matching. Try it! Also notice that we are pattern matching against one construction at a time. What about composite patterns such as Just (Right 5)? We don’t need to worry about this because the desugarizer would desugar it into two separate pattern matchings.
Cast (Expr b) Coercion

Because in the Core phase all type information is preserved, we still need to do type coercions as specified in System F_C. Ignore this for now if you don’t know what it is.
Tick (Tickish Id) (Expr b)

A tick is an attached piece of information on an expression for optimization purposes. Removing or adding ticks shouldn’t change the behavior of a program. I haven’t done much research about it and ignore all ticks in ES.hs.
Type Type

A type value. Because the Core language is in System F_C, types are explicitly passed around as regular values. Therefore, we need to be able to represent them as expressions.
Coercion Coercion

Ditto.

So this is the entirety of the GHC Core language! Later on we should explain how Type works, and specifically work on DataCon and TyCon which represent data constructors and type constructors.

Whether it is infinite can’t be determined because of the Halting problem. ↩
I made this one up as I don’t have a proper term for it. Basically it is the information encoded by the IdInfo data type. ↩

Notes on GHC (1) - The Core Language

Introduction

GHC Core

`Var Id`

`Lit Literal`

`App (Expr b) (Arg b)`

`Lam b (Expr b)`

`Let (Bind b) (Expr b)`

`Case (Expr b) b Type [Alt b]`

`Cast (Expr b) Coercion`

`Tick (Tickish Id) (Expr b)`

`Type Type`

`Coercion Coercion`