Chapter 11
by Arpon Sarker
The Elements of Computing Systems Chapter 11 - Compiler II: Code Generation
Introduction
In this chapter, after parsing and understanding the validity and structure of the input text given as high-level code, the compiler now sets upon the task of generating the corresponding code into VM code as many of the low-level concerns such as memory management was handled in the prior VM chapters. Modern high-level programming languages are rich with features such as abstractions (of objects and functions), flow of control statements, and building data structures of unlimited complexity.
Symbol Table
The compiler itself must map variable’s kind (static, field, argument, local) and type (int, String, boolean) where kind implicitly manages the variable’s life cycle and scope, index - the index represents each unique kind, numbering up from 0. If a new identifier is encountered, it is added with its corresponding scope and if already in the table, the compiler looks it up and replaces it, using its kind as the memory segment in VM code and may use its index. For scopes, the compilor looks in the local scope first and if not found will look up to larger scopes to find the variable globally. This lends itself to the use of a list of hash tables.
Handling Variables
Mapping various types of variables in the source program onto the target memory is not trivial: different types of variables require different sizes of memory chunks (not implemented here), different kinds of variables have different life cycles (static vs field and recursion for each subroutine’s scope). Luckily, this was all handled in the VM implementation. For any language L, any L-to-VM compiler is completely relieved from low-level memory management.
Handling Arrays
Arrays are stored as sequences of consecutive memory locations (even multi-dimensional arrays are flattened to 1-D). The array name is treated as the pointer to the base address of the RAM allocated to store the array in memory. Dynamic memory allocation is done from the heap, where when the array is declared, it results in the allocation of a single pointer which points to the array’s base address, if constructed at run-time. Pointer arithmetic is used adding the base address (from the pointer) and the index as an offset.
Handling Objects
Object instances encapsulate its class fields as well as a set of operations. The low-level handling of the object data is similar to arrays where we store fields in consecutive memory locations. Similarly, when an object instance is declared, the compiler only allocates a pointer variable and the actual object is created by a call to the class constructor. The data encapsulated is just accessed linearly using an index relative to the base address.
In order to apply a method to some object b, we just do b.mult(5) since this method only has one copy stored unlike the data fields but must look as if this single method is encapsulated on its own instance. Hence, we pass a reference to the manipulated object as a hidden argument as seen above, translated from mult(b,5). So we must push b then 5 and call mult. The compiler must also ensure that different methods found in different classes with the same name are applied to the right object. In Jack, which is a compile-time langauge, the compiler must ensure the method belongs to the class in which it is being derived. For run-time languages that support run-time typing it can be done while the program is running.
Commands Translation
The parse tree was already created from the syntax analyzer but we can traverse the tree in post-order traversal and use postfix notation or Right Polish Notation (RPN) for printing out the operands first and then the operator last. For example, f(x,y) becomes x, y, f which becomes push x, push y, call f.
For translating flow control, we are only implementing if and while which only rely on the low-level primitives of a conditional go-to and unconditional go-to. A program that contains multiple if and while statements requires unique label names to differentiate between the loops and nesting should also be allowed which can be done by using a recursive strategy.
Implementation
The implementation for the entire compiler consists of a top-level driver, tokenizer, symbol table, output module for generating VM code, and a recursive top-down compilation engine. The output module is called by the compilation engine to output its corresponding VM code but the compilation engine itself goes through the parsed code and figures out what should be generated.
Perspective
The extensions that can be made to Jack which utilised sidestepped more complex issues for compilation ease is that all of Jack’s data types are 16-bit long and the language semantics (“var”) allows the compiler to ignore the information, array entries are not typed, does not support inheritance, and does not support public class fields, no for and switch statements (simple), assigning constants to type ‘char’ (presently, “c” must be assigned to String and then char), and no optimisation.
Experience
Unfortunately for this chapter, I could not complete it due to strict limitations (vacation + Christmas) on the 19th of December. However, I did get quite close but looking back, I honestly just wrote spaghetti and didn’t want to remove the XML output as I would lose what place I’m at in the code. If I had more time, I would’ve tried to understand and internalise all the the little idiosyncratic tidbits such as function names having to be of “class_name.function_name” format, unique label names (I made them all unique and am not sure if I was supposed to do that) and I did not use any helper functions to cut down the code which would’ve been very helpful for the 1100+ lines of code and the obvious massive amounts of repetition I ignored. I was honestly stumped when it came to how I would generate the corresponding VM code but after looking at someone else’s code I finally understood what to do.
tags: compsci - computer_architecture - programming